# Different imports and setup

In [1]:
# Different libraries import
import pandas as pd
from datetime import datetime
import bz2
import json
from tqdm.notebook import tqdm

# Tutorial 

## Extracting the domain names

This is an example on how to extract domain names from a sample. To do that, we can use *tld* library. To install it:

Following function then gives domain name. It takes as an argument an URL and returns the domain name:

In [2]:
from tld import get_tld

def get_domain(url):
    res = get_tld(url, as_object=True)
    return res.tld


Now we will have to read the data. Each sample has property 'urls' which contains a list of links to the original articles containing the quotation. We will extract domain names for these links. Then, we will save a new file that contains samples with extracted domains. The new file will be saved in local storage in Colab but you can change path_to_out variable (optionally) if you want to save it directly to the drive. To generate a new file, run this cell:

It should take around 25min for this cell to finish running and you will be able to see a file (*quotes-2020-domains.json.bz2*) in the file explorer on the left side once it is done.

You are all set, good luck! :)

# Milestone 1




## Project description

First, one could try to associate a certain type of quote to the day(and even date) of the article. We do not have any pre-established pattern, but one could try to find one if any exists. Maybe there is a link between the nature of the quotation (political, good news/bad news, humoristic quote, and so on) and the day of the article. This can also be done with the consideration of historical event on the date of the publication. For example, there might be less joyful quotes in days during which tragic events happened.

## Code

### Functions

In [3]:
def retrieve_day(Date):
  """ Retrieve the day from a date with format 'YYYY-MM-DD hh:mm:ss' where Y is years, M months, D day, h hours,
      m minut and s seconds 
      Needs the following import :
      from datetime import datetime
      """

  try:                        
    date = datetime.strptime(Date, '%Y-%m-%d %H:%M:%S') # Convert to a datetime object and check that the format is correct and the numbers are valid
  except ValueError:
    raise ValueError('The string \'' + Date + '\' does not match the format \'YYYY-MM-DD hh:mm:ss\'') from None # Customize the error message

  if (date.year not in [2015, 2016, 2017, 2018, 2019, 2020]): # Check that the year is in the correct interval and that we do not have wrong data
    raise ValueError(f'The year {date.year} that you provided is not between 2015 and 2020 (inclusive).')

  weekday = datetime.isoweekday(date) # Obtain the weekday from 1 (monday) to 7 (sunday)

  # There are no switch/case statement in Python <= 3.10...
  if (weekday == 1):
    return 'Monday'
  elif (weekday == 2):
    return 'Tuesday'
  elif (weekday == 3):
    return 'Wednesday'
  elif (weekday == 4):
    return 'Thursday'
  elif (weekday == 5):
    return 'Friday'
  elif (weekday == 6):
    return 'Saturday'
  else:
    return 'Sunday'

In [4]:
def create_frame(filename, N):
  """ Creates a DataFrame with the N first rows of the file with filename 
      It is useful to load small portions and test things.            """
  list_of_dicts = []
  with bz2.open(filename, 'rb') as s_file:
      for i, instance in enumerate(s_file):
        if (i>N-1):
            break
        instance = json.loads(instance) # loading a sample
        list_of_dicts.append(instance)
      
  return pd.DataFrame(list_of_dicts)
        

### Main code

In [5]:
filename = 'quotes-2020.json.bz2' 
df = create_frame(filename, 100)

In [6]:
df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2020-01-28-000082,[ D ] espite the efforts of the partners to cr...,,[],2020-01-28 08:04:05,1,"[[None, 0.7272], [Prime Minister Netanyahu, 0....",[http://israelnationalnews.com/News/News.aspx/...,E
1,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,[Q367796],2020-01-16 12:00:13,1,"[[Sue Myrick, 0.8867], [None, 0.0992], [Ron Wy...",[http://thehill.com/opinion/international/4782...,E
2,2020-02-10-000142,... He (Madhav) also disclosed that the illega...,,[],2020-02-10 23:45:54,1,"[[None, 0.8926], [Prakash Rai, 0.1074]]",[https://indianexpress.com/article/business/ec...,E
3,2020-02-15-000053,"... [ I ] f it gets to the floor,",,[],2020-02-15 14:12:51,2,"[[None, 0.581], [Andy Harris, 0.4191]]",[https://patriotpost.us/opinion/68622-trump-bu...,E
4,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,[Q20684375],2020-01-24 20:37:09,4,"[[Meghan King Edmonds, 0.5446], [None, 0.2705]...",[https://people.com/parents/meghan-king-edmond...,E


In [7]:
a = retrieve_day(df.loc[2, 'date'])
print(a)

Monday


In [10]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/main/nltk_data...


True

In [11]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

In [17]:
analyzer.polarity_scores(df.quotation[0])

{'neg': 0.0, 'neu': 0.768, 'pos': 0.232, 'compound': 0.872}

In [18]:
filename = 'quotes-2020_days.json.bz2' 
df = create_frame(filename, 100)

In [20]:
df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,day
0,2020-01-28-000082,[ D ] espite the efforts of the partners to cr...,,[],2020-01-28 08:04:05,1,"[[None, 0.7272], [Prime Minister Netanyahu, 0....",[http://israelnationalnews.com/News/News.aspx/...,E,Tuesday
1,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,[Q367796],2020-01-16 12:00:13,1,"[[Sue Myrick, 0.8867], [None, 0.0992], [Ron Wy...",[http://thehill.com/opinion/international/4782...,E,Thursday
2,2020-02-10-000142,... He (Madhav) also disclosed that the illega...,,[],2020-02-10 23:45:54,1,"[[None, 0.8926], [Prakash Rai, 0.1074]]",[https://indianexpress.com/article/business/ec...,E,Monday
3,2020-02-15-000053,"... [ I ] f it gets to the floor,",,[],2020-02-15 14:12:51,2,"[[None, 0.581], [Andy Harris, 0.4191]]",[https://patriotpost.us/opinion/68622-trump-bu...,E,Saturday
4,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,[Q20684375],2020-01-24 20:37:09,4,"[[Meghan King Edmonds, 0.5446], [None, 0.2705]...",[https://people.com/parents/meghan-king-edmond...,E,Friday


In [None]:
# To add the weekday to every quote of the dataset. This will save a new dataset with the added day in the folder "Quotebank_days"
# We create a new dataset to not alter the original in case of problems

years = [2015, 2016, 2017, 2018, 2019, 2020]

for year in tqdm(years):

  path_to_file = f'quotes-{year}.json.bz2' 
  path_to_out = f'quotes-{year}.json.bz2'

  with bz2.open(path_to_file, 'rb') as s_file:
    with bz2.open(path_to_out, 'wb') as d_file:
      for instance in tqdm(s_file):
        instance = json.loads(instance) # loading a sample
        day = retrieve_day(instance['date'])
        instance['day'] = day
        d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file

In [21]:
years = [2015, 2016, 2017, 2018, 2019, 2020]

for year in tqdm(years):

  path_to_file = f'quotes-{year}_days.json.bz2' 
  path_to_out = f'quotes-{year}_days_sentiment.json.bz2'

  with bz2.open(path_to_file, 'rb') as s_file:
    with bz2.open(path_to_out, 'wb') as d_file:
      for instance in tqdm(s_file):
        instance = json.loads(instance) # loading a sample
        sent = analyzer.polarity_scores(instance['quotation'])
        instance['sentiment'] = sent
        d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file

  0%|          | 0/6 [00:00<?, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]