# Quotes classification

In this notebook, we treat each quote of the corpus and assign them a binary score determining if the quote is formal or colloquial according the dictionnary of english contractions to avoid in wikipedia articles

## Packages and functions

In [1]:
#packages
import pandas as pd
from tqdm import tqdm, notebook
notebook.tqdm().pandas()

0it [00:00, ?it/s]

In [2]:
#function to check if strin is contained and update dico
def isinside1(test_string,test_list):
    global dico
    res = [ele for ele in test_list if(ele in test_string)]
    #print(res)
    if res:
        for ele in res:
            i= test_list.index(ele)
            dico["occurences"].loc[i]= dico["occurences"].loc[i]+1
        return 1
    return 0

def isinside2(test_string,test_list):
    if any(ext in test_string for ext in test_list):
        return 1
    return 0

## Dictionnary loading
The dictionnary is compiled in the notebook `contractions_dictionary.ipynb` and is based on the [Wikipedia English contractions list](https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions).<br>
A column name "occurences" is created in order to count how much time a word is detected.<br>
A list is made from the dictionary words for comparison

In [3]:
dico= pd.read_pickle("./english_contractions.pkl")
dico["occurences"]=0
dicolist= dico["word"].unique().tolist()

## Quotebank sample loading

In [4]:
df= pd.read_json("../../../Sample_classified_1Mio_v1.json.bz2",compression="bz2",lines=True)

### Quotes formatting for comparison : lowercase and space at the beginning and end
Note that tokenisation has not been used as some of the words or the dictionary consist in several tokens (for exemple, isn't is composed of tokens "is" and "n't"

In [32]:
df_tested_quotes= df["quotation"].progress_apply(lambda x : " "+x.lower()+" ")

  0%|          | 0/668534 [00:00<?, ?it/s]

## Classification using the full dictionary
The dataset quotes are classified a first time using the full dictionary

### Classifying

In [35]:
df["colloquial"]= df_tested_quotes.progress_apply(lambda x : isinside1(x,dicolist))

  0%|          | 0/668534 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


### Quotes statistics

In [36]:
df.describe()

Unnamed: 0,numOccurrences,p1,p2,delta_p,year,colloquial
count,668534.0,668534.0,668534.0,668534.0,668534.0,668534.0
mean,3.55898,0.818245,0.161096,0.657149,2017.536327,0.383233
std,22.64603,0.095738,0.081709,0.173361,1.770882,0.486175
min,1.0,0.5001,0.0086,0.3,2015.0,0.0
25%,1.0,0.7474,0.0934,0.5219,2016.0,0.0
50%,1.0,0.8301,0.1526,0.6749,2018.0,0.0
75%,2.0,0.8973,0.2219,0.8022,2019.0,1.0
max,12086.0,0.9908,0.35,0.9821,2020.0,1.0


About 38.3% of quotes are qualified as colloquial.

### Most common words

In [37]:
dico["occurence_fraction"]= dico["occurences"]/df["colloquial"].count()
dico.sort_values(by='occurence_fraction', ascending=False)[:25]

Unnamed: 0,word,occurences,occurence_fraction
63,it's,117142,0.175222
19,don't,64493,0.096469
54,i'm,55166,0.082518
102,that's,52490,0.078515
129,we're,46661,0.069796
42,he's,30909,0.046234
107,there's,24987,0.037376
130,we've,24858,0.037183
17,didn't,24844,0.037162
59,i've,23632,0.035349


## Classification using the reduced dictionary
The dataset quotes are classified again using a dictionary from which the most common words have been removed

### Removal of word that appear in more than a certain fraction of words defined in the thresh variable

In [38]:
tresh= 0.02
dico2= dico[dico["occurence_fraction"]<tresh]
dicolist2= dico2["word"].unique().tolist()

### Classifying

In [39]:
df["colloquial"]= df_tested_quotes.progress_apply(lambda x : isinside2(x.lower(),dicolist2))

  0%|          | 0/668534 [00:00<?, ?it/s]

### Quotes statistics

In [40]:
df.describe()

Unnamed: 0,numOccurrences,p1,p2,delta_p,year,colloquial
count,668534.0,668534.0,668534.0,668534.0,668534.0,668534.0
mean,3.55898,0.818245,0.161096,0.657149,2017.536327,0.102488
std,22.64603,0.095738,0.081709,0.173361,1.770882,0.30329
min,1.0,0.5001,0.0086,0.3,2015.0,0.0
25%,1.0,0.7474,0.0934,0.5219,2016.0,0.0
50%,1.0,0.8301,0.1526,0.6749,2018.0,0.0
75%,2.0,0.8973,0.2219,0.8022,2019.0,0.0
max,12086.0,0.9908,0.35,0.9821,2020.0,1.0


We now have 10.2% of colloquial quotes

## Quotes classification examples

In [41]:
import random
df_formal= df[df["colloquial"]==0].reset_index()
df_colloquial= df[df["colloquial"]==1].reset_index()

print("5 formal quotes sample : ")
for i in random.sample(range(len(df_formal)), 5):
    print("\n")
    print(df_formal["quotation"].loc[i])

print("\n5 colloquial quotes sample : ")
for i in random.sample(range(len(df_colloquial)), 5):
    print("\n")
    print(df_colloquial["quotation"].loc[i])

5 formal quotes sample : 


I can only talk from last year, but I think we had the feeling that we were like a top-two team and we maybe had the pressure in the semifinal. I think this group, and this year, is different. I think we can strike from behind and focus on being on the semifinal.


For the most part, everything else at United is going really well. Operationally, United is running the best airline... that we ever run,


It's temporary. Water is yet to recede in many places. As a result, less fish is being netted. Moreover, the ban on catching hilsa fish is another reason for the crisis.


Do you have what it takes to finish?


She is described as waifish -- that is, she was shorter than most models in the 90s -- but in reality, it's a small difference.
5 colloquial quotes sample : 


And there's a big dish in the middle that my husband's mother couldn't bear to throw away. And that ended up in the window.


You've got to be a winner and have all the right trajectory as a play