# Quotes classification

In this notebook, we treat each quote of the corpus and assign them a binary score determining if the quote is formal or colloquial according the dictionnary of english contractions to avoid in wikipedia articles

## Packages and functions

In [None]:
#packages
import pandas as pd
from tqdm import tqdm, notebook
notebook.tqdm().pandas()

In [None]:
#function to check if strin is contained and update dico
def isinside1(test_string,test_list):
    global dico
    res = [ele for ele in test_list if(ele in test_string)]
    #print(res)
    if res:
        for ele in res:
            i= test_list.index(ele)
            dico["occurences"].loc[i]= dico["occurences"].loc[i]+1
        return 1
    return 0

def isinside2(test_string,test_list):
    if any(ext in test_string for ext in test_list):
        return 1
    return 0

## Dictionnary loading
The dictionnary is compiled in the notebook `contractions_dictionary.ipynb` and is based on the [Wikipedia English contractions list](https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions).<br>
A column name "occurences" is created in order to count how much time a word is detected.<br>
A list is made from the dictionary words for comparison

In [None]:
dico= pd.read_pickle("./english_contractions.pkl")
dico["occurences"]=0
dicolist= dico["word"].unique().tolist()

## Quotebank sample loading

In [None]:
#df= pd.read_json("../../../Sample_classified_1Mio_v1.json.bz2",compression="bz2",lines=True)
df= pd.read_json("../../../polUS_quotes_speakers_merged.json.bz2",compression="bz2",lines=True)

### Quotes formatting for comparison : lowercase and space at the beginning and end
Note that tokenisation has not been used as some of the words or the dictionary consist in several tokens (for exemple, isn't is composed of tokens "is" and "n't"

In [None]:
df_tested_quotes= df["quotation"].progress_apply(lambda x : " "+x.lower()+" ")

## Classification using the full dictionary
The dataset quotes are classified a first time using the full dictionary

### Classifying

In [None]:
df["colloquial"]= df_tested_quotes.progress_apply(lambda x : isinside1(x,dicolist))

### Quotes statistics

In [None]:
df.describe()

About 38.3% of quotes are qualified as colloquial.

### Most common words

In [None]:
dico["occurence_fraction"]= dico["occurences"]/df["colloquial"].count()
dico.sort_values(by='occurence_fraction', ascending=False)[:25]

In [None]:
dico.sort_values(by='occurence_fraction', ascending=False).reset_index()["occurences"].plot(
    xlabel="word rank", ylabel= "# of word occurences", title= "Word occurences (log-log)",logy=True, logx=True)

Surprisingly, words use do not seem to follow a zipf law (the line would be straight if it was the case)

## Classification using the reduced dictionary
The dataset quotes are classified again using a dictionary from which the most common words have been removed

### Removal of word that appear in more than a certain fraction of words defined in the thresh variable

In [None]:
tresh= 0.02
dico2= dico[dico["occurence_fraction"]<tresh]
dicolist2= dico2["word"].unique().tolist()

### Classifying

In [None]:
df["colloquial"]= df_tested_quotes.progress_apply(lambda x : isinside2(x.lower(),dicolist2))

### Quotes statistics

In [None]:
df.describe()

We now have 10.2% of colloquial quotes

## Quotes classification examples

In [None]:
import random
df_formal= df[df["colloquial"]==0].reset_index()
df_colloquial= df[df["colloquial"]==1].reset_index()

print("5 formal quotes sample : ")
for i in random.sample(range(len(df_formal)), 5):
    print("\n")
    print(df_formal["quotation"].loc[i])

print("\n5 colloquial quotes sample : ")
for i in random.sample(range(len(df_colloquial)), 5):
    print("\n")
    print(df_colloquial["quotation"].loc[i])

## Larger samples classification

In [None]:
del df
for year in range(2019,2020):
    print("start")
    sample= pd.read_json("../../../large_sample/Sample_{}_wrangled.json.bz2".format(year)
                         ,compression="bz2",lines=True)
    print("opened sample {}".format(year))
    sample_lowercase_quotes= sample["quotation"].progress_apply(lambda x : " "+x.lower()+" ")
    print("lowercased sample")
    sample["colloquial"]= sample_lowercase_quotes.progress_apply(lambda x : isinside2(x.lower(),dicolist2))
    print("classified sample")
    sample.to_json("../../../large_sample/Sample_{}_classified.json.bz2".format(year)
                   ,compression='bz2',lines=True,orient="records")
    print("saved sample")
    del sample

In [None]:
df= pd.read_json("../../../large_sample/Sample_{}_classified.json.bz2".format(2019)
                         ,compression="bz2",lines=True)