# BDA Assignment #2: MapReduce
## Data Preprocessing 


#### Group members: 
Aaqib Ahmed Nazir (i22-1920), 
Arhum Khan (i22-1967), 
Ammar Khasif (i22-1968)

##### Section: DS-D

#### Libraries Useds:

In [1]:
import pandas as pd
from nltk.corpus import stopwords
import string
import numpy as np
import os 


### Preprocessing Steps
##### >Splitting the dataset into chunks of 1000 rows each and saving them as separate files.
##### >Applying the following preprocessing steps on each chunk

In [2]:
# Loading NLTK stop words
stop_words = set(stopwords.words("english"))


# Function to preprocess text and remove stopwords
def preprocess_and_remove_stopwords(text):
    if isinstance(text, str):
        text = text.lower()
        # Removing punctuation
        text = text.translate(str.maketrans("", "", string.punctuation))
        word_tokens = text.split()
        # Removing stopwords and joining the words back together
        filtered_text = " ".join(
            [word for word in word_tokens if word not in stop_words]
        )
        return filtered_text
    else:
        return np.nan


chunk_size = 10000

# Reading the dataset in chunks
chunk_index = 1
for chunk in pd.read_csv("Dataset.csv", chunksize=chunk_size):
    # Preprocess
    chunk["SECTION_TEXT"] = chunk["SECTION_TEXT"].apply(preprocess_and_remove_stopwords)
    # Drop rows with missing values
    chunk.dropna(subset=["SECTION_TEXT"], inplace=True)
    chunk.to_csv(f"mini_dataset_{chunk_index}.csv", index=False)
    chunk_index += 1
    

### Combining the mini datasets into one dataset

In [3]:
# Combine all preprocessed chunks into a single DataFrame
preprocessed_dfs = []
for i in range(1, chunk_index):
    preprocessed_dfs.append(pd.read_csv(f'mini_dataset_{i}.csv'))
    

### Saving the combined dataset

In [4]:
# Combine all preprocessed chunks into a single DataFrame
combined_df = pd.concat(preprocessed_dfs, ignore_index=True)

# droping all the null values
combined_df.dropna(inplace=True)
print(combined_df.isnull().sum())
print(combined_df.shape)

combined_df.to_csv('preprocessed_dataset.csv', index=False)


ARTICLE_ID       0
TITLE            0
SECTION_TITLE    0
SECTION_TEXT     0
dtype: int64
(4194808, 4)


#### Printing the first 5 rows of the combined dataset

In [6]:
display(combined_df.head(5))
print(combined_df['SECTION_TEXT'][0])

Unnamed: 0,ARTICLE_ID,TITLE,SECTION_TITLE,SECTION_TEXT
0,0,Anarchism,Introduction,anarchism political philosophy advocates selfg...
1,0,Anarchism,Etymology and terminology,term anarchism compound word composed word ana...
2,0,Anarchism,History,origins woodcut diggers document william evera...
3,0,Anarchism,Anarchist schools of thought,portrait philosopher pierrejoseph proudhon 180...
4,0,Anarchism,Internal issues and debates,consistent anarchist values controversial subj...


anarchism political philosophy advocates selfgoverned societies based voluntary institutions often described stateless societies although several authors defined specifically institutions based nonhierarchical free associations anarchism holds state undesirable unnecessary harmful antistatism central anarchism specifically entails opposing authority hierarchical organisation conduct human relations including limited state system anarchism usually considered extreme leftwing ideology much anarchist economics anarchist legal philosophy reflects antiauthoritarian interpretations communism collectivism syndicalism mutualism participatory economics anarchism offer fixed body doctrine single particular world view instead fluxing flowing philosophy many types traditions anarchism exist mutually exclusive anarchist schools thought differ fundamentally supporting anything extreme individualism complete collectivism strains anarchism often divided categories social individualist anarchism simila

### Deleting the mini datasets to clear up space

In [10]:
for i in range(1, chunk_index + 1):
    os.remove(f"mini_dataset_{i}.csv")
    