# BDA Assignment #2: MapReduce
## Data Preprocessing 


#### Group members: 
Aaqib Ahmed Nazir (i22-1920), 
Arhum Khan (i22-1967), 
Ammar Khasif (i22-1968)

##### Section: DS-D

#### Libraries Used:

In [1]:
import os 
import string
import numpy as np
import pandas as pd
from langdetect import detect
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import os 

### Preprocessing Steps
##### >Splitting the dataset into chunks of 1000 rows each and saving them as separate files.
##### >Applying the following preprocessing steps on each chunk

In [2]:
# Loading NLTK stop words
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

# Function to preprocess text, remove stopwords, and lemmatize English words
def preprocess_remove_stopwords_and_lemmatize(text):
    if isinstance(text, str):
        text = text.lower()
        # Removing punctuation
        text = text.translate(str.maketrans("", "", string.punctuation))
        word_tokens = word_tokenize(text)
        # Lemmatizing English words, removing stopwords, and joining the words back together
        filtered_text = " ".join([
            lemmatizer.lemmatize(word) if word not in stop_words else word for word in word_tokens
        ])
        return filtered_text
    else:
        return np.nan

chunk_size = 10000

# Reading the dataset in chunks
chunk_index = 1
for chunk in pd.read_csv("Dataset.csv", chunksize=chunk_size):
    # Preprocess, remove stopwords, and lemmatize
    chunk["SECTION_TEXT"] = chunk["SECTION_TEXT"].apply(preprocess_remove_stopwords_and_lemmatize)
for chunk in pd.read_csv("data.csv", chunksize=chunk_size):
    # Preprocess
    chunk["SECTION_TEXT"] = chunk["SECTION_TEXT"].apply(preprocess_and_remove_stopwords)
    # Drop rows with missing values
    chunk.dropna(subset=["SECTION_TEXT"], inplace=True)
    chunk.to_csv(f"mini_dataset_{chunk_index}.csv", index=False)
    chunk_index += 1


### Combining the mini datasets into one dataset

In [3]:
# Combine all preprocessed chunks into a single DataFrame
preprocessed_dfs = []
for i in range(1, chunk_index):
    preprocessed_dfs.append(pd.read_csv(f'mini_dataset_{i}.csv'))


### Dropping unnecessary columns and saving the dataset

In [4]:
# Combine all preprocessed chunks into a single DataFrame
combined_df = pd.concat(preprocessed_dfs, ignore_index=True)

# dropping all the null values
combined_df.dropna(inplace=True)
print(combined_df.isnull().sum())
print(combined_df.shape)


ARTICLE_ID       0
TITLE            0
SECTION_TITLE    0
SECTION_TEXT     0
dtype: int64
(4194966, 4)


In [5]:
combined_df.to_csv('preprocessed_dataset.csv', index=False)

In [6]:
combined_df = pd.read_csv("C:\\Users\\Arhum Khan\\Desktop\\preprocessed_dataset.csv")


# combined_df = pd.read_csv("combined_df.csv")
combined_df.columns = [None] * len(combined_df.columns)

# Display the DataFrame
print(combined_df.head())

# save the preprocessed data
combined_df.to_csv("combined_df.csv", index=False)

  None       None                          None  \
0    0  Anarchism                  Introduction   
1    0  Anarchism     Etymology and terminology   
2    0  Anarchism                       History   
3    0  Anarchism  Anarchist schools of thought   
4    0  Anarchism   Internal issues and debates   

                                                None  
0  anarchism political philosophy advocates selfg...  
1  term anarchism compound word composed word ana...  
2  zzorigins woodcut diggers document william eve...  
3  portrait philosopher pierrejoseph proudhon 180...  
4  consistent anarchist values controversial subj...  


#### Printing the first 5 rows of the combined dataset

In [6]:
display(combined_df.head(5))
print(combined_df['SECTION_TEXT'][0])

Unnamed: 0,ARTICLE_ID,TITLE,SECTION_TITLE,SECTION_TEXT
0,0,Anarchism,Introduction,anarchism is a political philosophy that advoc...
1,0,Anarchism,Etymology and terminology,the term anarchism is a compound word composed...
2,0,Anarchism,History,origin woodcut from a digger document by willi...
3,0,Anarchism,Anarchist schools of thought,portrait of philosopher pierrejoseph proudhon ...
4,0,Anarchism,Internal issues and debates,consistent with anarchist value is a controver...


anarchism is a political philosophy that advocate selfgoverned society based on voluntary institution these are often described as stateless society although several author have defined them more specifically as institution based on nonhierarchical free association anarchism hold the state to be undesirable unnecessary and harmful while antistatism is central anarchism specifically entail opposing authority or hierarchical organisation in the conduct of all human relation including but not limited to the state system anarchism is usually considered an extreme leftwing ideology and much of anarchist economics and anarchist legal philosophy reflects antiauthoritarian interpretation of communism collectivism syndicalism mutualism or participatory economics anarchism does not offer a fixed body of doctrine from a single particular world view instead fluxing and flowing as a philosophy many type and tradition of anarchism exist not all of which are mutually exclusive anarchist school of tho

### Deleting the mini datasets to clear up space

In [7]:
for i in range(1, chunk_index + 1):
    os.remove(f"mini_dataset_{i}.csv")
