# Task 1: Acquire and preprocess the data
* 20 news group dataset from _sklearn_
* IMDB Reviews: located at root of this directory. GCS is too slow for a lot of files.

## some notes:
* look at `spacy` package for preprocessing? we might be able to extract adjectives and give them more weights compared tot he rest of the words.
* https://www.machinelearningplus.com/nlp/gensim-tutorial/
* look at _n-grams_ (2 or 3 maybe useful)
    * "very" might come up oftern in TFIDF and might be ignored. but if we look at 2-grams, "very-good" and "very-bad" provides very good insight on the rating of the moving and it should increase the TFIDF ranking
* any processing that we want to do should be done in the `PreProcess` class so that it can be applied to both datasets
* maybe look at phoneme of words?
* fixing typos in review
    * make words in singular and masculin 

spacy and nltk have different stopword library --> they are outputed in the dataset

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams

dataset_path = Path('dataset')
dataset_path.mkdir(exist_ok=True, parents=True)

nltk.download('stopwords')

bigram_series = pd.read_html('dataset/bigrams.html')[1]
bigram_series[0] = bigram_series[0].astype(str)
bigram_series[0] = bigram_series[0].apply(lambda x : x.replace(u'\xa0', "_"))

# make it a set because O(1) check vs O(n) check
bigram_series = set(bigram_series[0].to_list())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
class PreProcessor:
    
    def __init__(self, df:pd.DataFrame, words_to_remove:list, f_name:str, process_column:str='sentence',token:str=r'\w+'):
        """df['sentence'] needs to be string """
        self.df = df
        self.stop_words = set(stopwords.words('english')+words_to_remove)
        self.f_name = f_name
        self.process_column = process_column
        self.tokenizer = RegexpTokenizer(token)
        self.__did_process = False
        self.exp_df = None
    
    def process(self) -> (pd.DataFrame, pd.DataFrame):
        """df['sentence'] is an array of tokenized words based on processing settings
        """
        if self.__did_process:
            return self.df, self.exp_df
        
        ###################    cleaning    ###########################
        # add any other steps that you need. it will be applied to all dataset

        self.df[self.process_column] = self.df[self.process_column].astype(str)

        # remove \n, \r, \t into " "
        self.df[self.process_column] = self.df[self.process_column].apply(lambda x: x.replace('\n', ' ').replace('\r', '').replace('\t', ' '))

        # tokenize the strings into array. makes them lowercase too
        self.df[self.process_column] = self.df[self.process_column].apply(lambda x: self.tokenizer.tokenize(x.lower()))

        # compute 2 ngrams and check if they are in the list of ngrams
        self.df[self.process_column] = self.df[self.process_column].apply(lambda x : x+['_'.join(i) for i in list(ngrams(x, 2)) if '_'.join(i) in bigram_series])
        
        # n-grams 
        # methodology: generate all ngrams then look if they are in this list http://phrasesinenglish.org/explorengrams.html#filterdiv
        # https://albertauyeung.github.io/2018/06/03/generating-ngrams.html
        # 'very' 'good'
        # 'very_good' --> 2-ngrams , post
        # 'very_bad' --> neg


        # remove common stopwords from ntlk library and also some domain specific words
        self.df[self.process_column] = self.df[self.process_column].apply(lambda x : [item for item in x if item not in self.stop_words])

        #################################################################

        self.df.to_csv(dataset_path.joinpath('{}_row_array_bigram.csv'.format(self.f_name)), index=False)

        # see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html
        self.exp_df = self.df.explode(self.process_column)
        self.exp_df.reset_index(inplace=True, drop=True)
        self.exp_df.to_csv(dataset_path.joinpath('{}_exploded_bigram.csv'.format(self.f_name)), index=False)


        self.__did_process = True
        
        return self.df, self.exp_df
    
    

## Process IMDB Review
* core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets
* no more than 30 reviews are allowed for any given movie because reviews
* negative review has a score less or equal to 4 out of 10
* positive review has a score greater or equal to 7 out of 10
* neutral ratings are not included in the train/test sets

Withn CSV:
* _review_id_ : id of review
* _train_or_test_ : test or train, from dataset
* _review_type_ : pos or neg
* _review_number_ : rating given (out of 10)
* _sentence_ : review sentece

In [None]:
# dont need to run this anymore as everything is parsed

# imdb_train_neg_path = Path('aclImdb','train', 'neg').glob('**/*')
# imdb_train_pos_path = Path('aclImdb','train', 'pos').glob('**/*')

# imdb_train_files = [x for x in imdb_train_neg_path if x.is_file()] + [x for x in imdb_train_pos_path if x.is_file()]

# imdb_test_neg_path = Path('aclImdb','test', 'neg').glob('**/*')
# imdb_test_pos_path = Path('aclImdb','test', 'pos').glob('**/*')

# imdb_test_files = [x for x in imdb_test_neg_path if x.is_file()] + [x for x in imdb_test_pos_path if x.is_file()]

## Convert text file to CSV

We will put everything in one csv for now as it will make the cleaning easier. After we can simply use pandas to split the dataset into what it was.

```
df[df['train_or_test] == 'test' ]
df[(df['train_or_test] == 'test') and (df['review_type] == 'pos') ]
```

The arrays are converted into strings when saving to a csv. In order to convert them back run the following
```
from ast import literal_eval
df = pd.read_csv('imdb_no_punctuation.csv')
df['sentence'] = df['sentence'].apply(lambda s:literal_eval(s))
```

In [None]:
def convert_imbd_to_csv(file_lst, output_name):
    df = pd.DataFrame(columns=['review_id','train_or_test','review_type', 'review_number' ,'sentence'])

    for file in file_lst:
        with open(file, 'r') as f:
            detail = file.stem.split('_')
            path = str(file.parent).split('/')
            df = df.append({
                'train_or_test':path[1],'review_type':path[2], 'sentence':f.read(), 'review_number':detail[1], 'review_id':detail[0]
                }, ignore_index=True)
    df.to_csv(output_name, index=False)

# dont run this, it takes 10min
# convert_imbd_to_csv(imdb_test_files+imdb_train_files, dataset_path.joinpath('imdb_raw.csv'))

In [None]:
imdb_raw_df = pd.read_csv(dataset_path.joinpath('imdb_raw.csv'))
"""
list of words that are common to both dataset
    > we can play with which word to remove and see the performance of the model
    br is a html tag
"""
common_words = [
    'br', 
    # 'film', 
    # 'movie', 
    # 'one', 
    # 'like', 
    # 'good',
    # 'time'
    ]
imdb_processor = PreProcessor(imdb_raw_df, common_words,'imdb')

In [None]:
imdb_row_df,imdb_exploded_df = imdb_processor.process()

### Analysis

In [None]:
grouped = imdb_exploded_df[imdb_exploded_df['review_type'] == 'pos']['sentence'].value_counts()
print(grouped[ grouped == 1].size)

# we can finetune this, max is around 60k so we can remove up to like 500?. the better we can reduce this, the faster our model will be and the better the results
grouped[ grouped <= 500].size

36796


145733

In [None]:
imdb_exploded_df[imdb_exploded_df['review_type'] == 'pos']['sentence'].value_counts()

film            42110
of_the          41598
movie           37854
one             27320
in_the          25738
                ...  
fewer_people        1
tvms                1
douchewads          1
blissed             1
dispair             1
Name: sentence, Length: 147528, dtype: int64

## Process 20 news group dataset from _sklearn_

Within CSV:
* _sentence_ : from dataset
* _target_ : from dataset
* _train_or_test_ : train or test instance, from dataset

The data is pretty fucked from _fetch_20newsgroups_ so the CSV are tab separated, make sure to include `sep='\t'` when you read the csv.

In [None]:
from sklearn.datasets import fetch_20newsgroups
twenty_news_group_train  = fetch_20newsgroups(subset='train', remove=(['headers', 'footers', 'quotes']))
twenty_news_group_test  = fetch_20newsgroups(subset='test', remove=(['headers', 'footers', 'quotes']))

In [None]:
twenty_news_train_df = pd.DataFrame(data={'target': twenty_news_group_train.target, 'train_or_test':'train', 'sentence': twenty_news_group_train.data})
twenty_news_test_df = pd.DataFrame(data={'target': twenty_news_group_test.target, 'train_or_test':'test', 'sentence': twenty_news_group_test.data})

twenty_news_combined_df = twenty_news_train_df.append(twenty_news_test_df)

twenty_news_combined_df['sentence'] = twenty_news_combined_df['sentence'].apply(lambda x: x.replace('\n', ' ').replace('\r', '').replace('\t', ' ').strip())
twenty_news_combined_df.reset_index(inplace=True)
twenty_news_combined_df.rename(columns={'index': 'id'}, inplace=True)
twenty_news_combined_df.to_csv(dataset_path.joinpath('twenty_news_raw.csv'), sep='\t', index=False)

In [None]:
twenty_news_raw_df = pd.read_csv(dataset_path.joinpath('twenty_news_raw.csv'), sep='\t')
common_words = [
    # to be determined
]

twenty_news_processor = PreProcessor(twenty_news_raw_df,common_words,'twenty_news')

In [None]:
twenty_news_row_df,twenty_news_exploded_df = twenty_news_processor.process()

### Analysis of data

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f5e36ec1-5982-458d-86cb-de064c0212ca' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>