# TO RUN : importS

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
sns.set(font_scale = 1.3, rc = {'figure.figsize':(10,6)})
sns.set_palette('colorblind')

# 1- SELECT DATA OF INTEREST 
As the project focus on the debate on the right to bear arms in the USA, the first task consist on selecting the data related to this topic. To do so, we used a lexical field related to the topic, and select only the quotations that contain one or more words of this lexical field. <br> The lexical field determined is the set of the following words : 'gun','firearm','mass shooting','2nd Amendment','murder','homicide','gun shot','armed robbery','rifles','Second Amendment','Columbine', 'gun control'. <br> <br>
The selected quotes are then stored in a new data file named `quotes-20__-extended.json.bz2` in form of a dataframe with some new columns. The columns added contain information about the speakers (gender, nationality, occupations, age (computed from the date of birth), ethnic group, party and religion). Such information are taken from the second dataset `speaker_attributes.parquet` itself built from wikidata information. Quotations that are not related to any seaker are not kept. <br> <br>
As such selection of data requires a lot of time, we decided to only treat the quotations of 2017 for this part of the project. The corresponding file having a weight of 5 Go, it contains way enough quotation to perform first statistic tests and check if our project is feasible. <br>
This is also why we decide to save the most information possible about the speaker. <br> <br>


In [None]:
lexical_field = ['gun','firearm','mass shooting','2nd Amendment','homicide','gun shot','armed robbery','rifles','Second Amendment','Columbine', 'gun control']

speakers = pd.read_parquet(data_folder + 'speaker_attributes.parquet')
label = pd.read_csv(data_folder + 'wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')

In [None]:
path_to_file = data_folder + 'quotes-2017.json.bz2' 
path_to_out = data_folder + 'quotes-2017-extended-new.json.bz2'

iter = 0
nb_occ = 0

with bz2.open(path_to_file, 'rb') as s_file:
    with bz2.open(path_to_out, 'wb') as d_file:
        for instance in s_file:
            instance = json.loads(instance) # loading a sample

            iter += 1
            if instance['numOccurrences'] is not None:
                nb_occ += instance['numOccurrences']

            if (iter % 100000 == 0):
                print('nombre de citations lus: {}'.format(iter))

            if any(substring.lower() in instance['quotation'].lower() for substring in lexical_field) and instance['qids'] != []: #We keep only quotation containing words of the lexical field and where there is a speaker
                speaker = speakers.loc[speakers['id'] == instance['qids'][0]].squeeze()

                #We add nationality
                if speaker.nationality is not None:
                    instance['nationality'] = []
                    for i in speaker['nationality']:
                        nat = label.loc[i]['Label']
                        instance['nationality'].append(nat)
                else:
                    instance['nationality'] = None
                    
                #We add the gender
                if speaker.gender is not None:
                    instance['gender'] = []
                    for i in speaker['gender']:
                        gend = label.loc[i]['Label']
                        instance['gender'].append(gend)
                else:
                    instance['gender'] = None

                #We add the occupations
                if speaker.occupation is not None:
                    instance['occupation'] = []
                    for i in speaker['occupation']:
                        occ = label.loc[i]['Label']
                        instance['occupation'].append(occ)
                else:
                    instance['occupation'] = None

                #We add the date of birth
                try:
                    born = datetime.strptime(speaker.date_of_birth[0][1:11], "%Y-%m-%d").date()
                    today = date.today()
                    age = today.year - born.year - ((today.month, today.day) < (born.month, born.day))
                except:
                    age = None
                instance['age'] = age

                #We add the ethnic group
                if speaker.ethnic_group is not None:
                    instance['ethnic_group'] = []
                    for i in speaker['ethnic_group']:
                        ethnic = label.loc[i]['Label']
                        instance['ethnic_group'].append(ethnic)
                else:
                    instance['ethnic_group'] = None

                #We add the party
                if speaker.party is not None:
                    instance['party'] = []
                    for i in speaker['party']:
                        part = label.loc[i]['Label']
                        instance['party'].append(part)
                else:
                    instance['party'] = None

                #We add the religion
                if speaker.religion is not None:
                    instance['religion'] = []
                    for i in speaker['religion']:
                        relig = label.loc[i]['Label']
                        instance['religion'].append(relig)
                else:
                    instance['religion'] = None

                d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file

                # to test
                # if iter > 10000:
                #     break

print('iter = {i}'.format(i = iter))
print('nb_occ = {n}'.format(n = nb_occ))

# 2- FIRST STATISTICS
Here, we try to extract a few basic statistics to make sure we have the necessary data to perform the project. Indeed, after selecting the quotations of interests (the ones related to arms in the USA), we want to make sure these datas are sufficiently numerous to perform an actual study on the right to bear arms in the USA. We also want to check if our research questions are reasonnnable and can be solved from our data. <br>
<br>
First, we get a sense of the quantity of actual quotations speaking about arms, and we compute its share in the total quotes dataset of 2017.

### Load the data
The new dataset `quotes-2017-extended.json` can be loaded from here.

In [1]:
gunquotes = pd.read_json('data/quotes-2017-extended.json.bz2', lines=True, compression='bz2')
gunquotes

NameError: name 'pd' is not defined

In [None]:
nblines_gunquotes = gunquotes.shape[0]
nbtot_gunquotes = gunquotes['numOccurrences'].sum()
print(nblines_gunquotes)
print(nbtot_gunquotes)

In [None]:
share_gunquotes = 100 * nblines_gunquotes / nblines_totquotes
sharetot_gunquotes = 100 * nbtot_gunquotes / nbtot_totquotes
print(share_gunquotes)
print(sharetot_gunquotes)

**Analysis :** <br>
The new dataset (with only selected quotes) contain 36462 different quotes, some of which are quoted in several articles. Thus, there is a total number of 212199 quotes found in the 2017 newspapers that are related to our topic. <br>
This represent a share of     % of the total quotations of 2017. Even if this share is very small, the size of the original 2017 dataset being very huge, it is not surprising and 36462 different quotations is already quite a lot of data for 1 year for our project.

### A first timeline for 2017

In [2]:
# Useful functions for the following plots

def show_values(axs, orient="v", space=.01):
    def _single(ax):
        if orient == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height() + (p.get_height()*0.01)
                value = '{:.0f}'.format(p.get_height())
                ax.text(_x, _y, value, ha="center") 
        elif orient == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height() - (p.get_height()*0.5)
                value = '{:.0f}'.format(p.get_width())
                ax.text(_x, _y, value, ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _single(ax)
    else:
        _single(axs)

#### 'Gun quotes' timeline per month

In [None]:
## Plot of the nb of quotes related to guns in function of the month

gunquotes['dateWithoutTime'] = gunquotes['date'].dt.normalize()

quotes_perMonth = gunquotes.groupby(gunquotes['dateWithoutTime'].dt.month).numOccurrences.sum()
print(quotes_perMonth)
print(quotes_perMonth.sum())
# ax = sns.barplot(x=gunquotes.groupby(gunquotes["dateWithoutTime"].dt.month), y=gunquotes.groupby(gunquotes['dateWithoutTime'].dt.month).numOccurrences.sum())#, data=gunquotes)
ax = sns.barplot(x=np.linspace(1,12,12), y=gunquotes.groupby(gunquotes['dateWithoutTime'].dt.month).numOccurrences.sum())#, data=gunquotes)
# ax = sns.barplot(x=quotes_perMonth.index, y=quotes_perMonth)
sns.set_color_codes("colorblind")
ax.set_xlabel('Months')
ax.set_ylabel('Number of quotes')
ax.set_title('Timeline of the gun-related quotations during the year 2017')
ax.set_xticklabels(labels=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])
plt.xticks(rotation=60)
plt.show()
show_values(ax)


**Analysis :** <br>
The barplot reveals an unexpected big amount of quotations speaking about guns in the months of June and October. One can guess that this is due to an event that occured in this month. Indeed, for example the amounts of quotes in October can be explianed by the Las Vegas shooting of the 1st of October. To verfiy this guess we will look further at the distribution of the quotes in thess 2 months.

#### Zoom on the month of June

In [None]:
## Plot of the nb of quotes related to guns in function of the date : zoom on JUNE

june = gunquotes[(gunquotes['date'].dt.month == 6)]

ax = sns.barplot(x=np.linspace(1,30,30, dtype='int'), y=june.groupby(june['dateWithoutTime']).numOccurrences.sum())#, data=gunquotes)
sns.color_palette("tab10")
ax.set_xlabel('Days')
ax.set_ylabel('Number of quotes')
ax.set_title('Timeline of the gun-related quotations during the year 2017')
#ax.set_xticklabels(labels=labs)
plt.xticks(rotation=60)
show_values(ax)
plt.show()

## Explanation :
## https://www.nytimes.com/2017/06/26/us/politics/supreme-court-guns-public-california.html
## https://www.pewresearch.org/social-trends/2017/06/22/americas-complex-relationship-with-guns/
## Fusillade de l'entrainement républicain du match de baseball du Congrès le 14 juin 2017 : https://fr.wikipedia.org/wiki/Fusillade_de_l%27entrainement_r%C3%A9publicain_du_match_de_baseball_du_Congr%C3%A8s
## Pizzagate shooters sentenced to 4 years of prison on the 22th of June 2017 : https://edition.cnn.com/2017/06/22/politics/pizzagate-sentencing/index.html

**Analysis :**  
TO COMPLETE 

#### Zoom on the month of October

In [3]:
## Plot of the nb of quotes related to guns in function of the date : zoom on OCTOBER

october = gunquotes[(gunquotes['date'].dt.month == 10)]
# nb = october.groupby(october['dateWithoutTime']).numOccurrences.sum()
# print(nb.sum())

ax = sns.barplot(x=np.linspace(1,31,31, dtype='int'), y=october.groupby(october['dateWithoutTime']).numOccurrences.sum())#, data=gunquotes)
sns.color_palette("tab10")
ax.set_xlabel('Days')
ax.set_ylabel('Number of quotes')
ax.set_title('Timeline of the gun-related quotations during the year 2017')
#ax.set_xticklabels(labels=labs)
plt.xticks(rotation=60)
show_values(ax)
plt.show()

## Explanation : Las Vegas shooting on the 1st of October 2017

NameError: name 'gunquotes' is not defined

**Analysis :**  
Most of the quotations of the mmonth of October were found in articles of the 2nd of October, the day after the mass shooting of Las Vegas. This thus confirms our guess that these articles spoke about this event. The following days, also show a higher amount of quotations, revealing that the medias continued to talk about the tragedy in the following days.

# 3- MATCHING OF THE DATASET FOR SENTIMENT ANALYSIS

### Matching of the dataset 

### Sentiment analysis

# Conclusions on the following of the project  
TO COMPLETE