TODO: Small description for notebook

## <span style="color:black">Mount drive to notebook</span>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## <span style="color:black">Install needed packages</span>

In [None]:
!pip install tld
!pip install pandas==1.0.5

import os
import bz2
import json
import glob
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#Packages for url parsing
import tld
from tld import get_tld
from urllib.request import urlopen
from bs4 import BeautifulSoup

#Packages for NLP methods
import re
import nltk
import gensim
from gensim import models
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
# We need this dataset in order to use the tokenizer
nltk.download('punkt')
# Also download the list of stopwords to filter out
nltk.download('stopwords')
stemmer = PorterStemmer()

# Add constants/paths
_DATASETS_PATHS = '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/Quotebank'

## <span style="color:Blue">1) Load Datasets</span>

In [4]:
# Save the datasets paths
quote_datasets = sorted(glob.glob(_DATASETS_PATHS+'/*.bz2'))
quote_datasets

['/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/Quotebank/quotes-2015.json.bz2',
 '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/Quotebank/quotes-2016.json.bz2',
 '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/Quotebank/quotes-2017.json.bz2',
 '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/Quotebank/quotes-2018.json.bz2',
 '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/Quotebank/quotes-2019.json.bz2',
 '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/Quotebank/quotes-2020.json.bz2']

### <span style="color:red"><div style="text-align: justify">In order to deal with extremely large data we decided chunking the data and apply all the processing and data wrangling steps on chunks then combine the chunk results (inspired by this [article](https://towardsdatascience.com/3-simple-ways-to-handle-large-data-with-pandas-d9164a3c02c1))</div></span>

In [8]:
#Iterate through years dataset then create an iterator to process chunks of 1M rows at a time
for dataset in quote_datasets:
    df_reader = pd.read_json(dataset, lines=True, compression='bz2', chunksize=500000)
    print('Processing Data:', os.path.basename(dataset))
    for chunk in df_reader:
        chunk = process_chunk(chunk)

Processing chunk with 1000000 rows
Index(['quoteID', 'quotation', 'speaker', 'qids', 'date', 'numOccurrences',
       'probas', 'urls', 'phase'],
      dtype='object')


## <span style="color:Blue">2) Topic Analysis</span>

### <span style="color:Red"><div style="text-align: justify">After loading the data, we need to extract the targeted quotes from the dataset that will be used in our project. Thus, there are three methods to do so explained below.</div></span>

### <span style="color:green">A- Manual Extraction Method</span>

#### <span style="color:purple"><div style="text-align: justify">This method is a naive one which means extracting the rows that their quotes include specific keywords. Accordingly, in order to compile a more thorough keywords, we investigated a couple of articles/papers concerning sexual harassement movements such as MeToo such as: [article 1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6751092/)</div></span>

#### <span style="color:purple"><div style="text-align: justify"> Then, we decided to filter the dataset from 2015-2020 with one keyword (metoo) and from the resulted quotes we built a word cloud using [WordCloud Library](https://amueller.github.io/word_cloud/) to select the top 20 keywords mentioned in these quotes and pick the relevant ones. Also, we added keywords from this [website](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/2SRSKJ) and this [article](https://journals.sagepub.com/doi/10.1177/1940161220968081) to end up at the end with 15 keywords.

In [None]:
me_too_quotes_path = '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/metoo-extracted-quotes.csv.bz2'

In [None]:
#Iterate through all datasets and extract the quotes containing metoo
for dataset in quote_datasets:
    df_reader = pd.read_json(dataset, lines=True, compression='bz2', chunksize=500000)
    print('Processing Data:', os.path.basename(dataset))
    for chunk in df_reader:
        chunk = process_chunk(chunk, utils.manual_extraction, me_too_quotes_path)

In [None]:
#Load the saved results from previous cell
df = pd.read_csv(me_too_quotes_path)
df.drop(df[df == 'quotation'].index)
df = df.drop(columns='Unnamed: 0')
print(f'The number of quotes that contain metoo: {len(df)}')

In [None]:
#Create a word cloud from the extracted quotes
df = df.quotation.str.replace('metoo', ' ')
#Process texts in quotes
cleaned_text = df.apply(utils.process_text)
#Join the different processed titles together.
long_string = ','.join([text for text_list in cleaned_text.values for text in text_list])
#Create a WordCloud object
wordcloud = WordCloud(width = 800, height = 800, background_color="white", max_words=1000, contour_width=3, contour_color='steelblue')
#Generate a word cloud
wordcloud.generate(long_string)
#Visualize the word cloud
wordcloud.to_image()

In [None]:
keywords = ['movement', 'women', 'victim', 'campaign', 'sex', 'harass', 'assualt',
           'rape', 'misconduct', 'metoo', 'timesup', 'abuse', 'workplace', 'right', 'femin']

Add Plots and mention limitations for this method

---

### <span style="color:green">B- Parsing URLs Method</span>

In [33]:
url = 'http://www.santacruzsentinel.com/environment-and-nature/20180630/dan-haifley-our-ocean-backyard-beach-cleanups-fight-plastic-pollution'
parse_title(url)

'environment-and-nature'

Add Plots and mention limitations for this method

---

### <span style="color:green">C- NLP-based Method</span>

In [None]:
topics = model.print_topics(num_words=3)
for topic in topics:
    print(topic)

---

## <span style="color:Blue">3) Dataset Augmentation</span>

### <span style="color:Red"><div style="text-align: justify">Here, we enrich our dataset with external ones, so we explore each external dataset and apply preliminary analysis on them.</div></span>

### <span style="color:green">A- Twitter Dataset</span>

#### <span style="color:Purple"><div style="text-align: justify">The tweets dataset is acquired from [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/2SRSKJ) which contains 32,071,469 #metoo tweets ranging from October 15, 2017 to March 31, 2020. The dataset only contains tweet IDs, thus, we applied for access to twitter api in order to fetch tweets and their metadata from IDs.</div></span>

In [None]:
#Twitter Developer keys here
#It is CENSORED
consumer_key = 'NTzTHBeXmtuLiurddtRhyvA5x'
consumer_key_secret = 'o4AhlY4tnTg4heGFSkttAa2w5CIiHhXaerckfu0sIaLAXpPEvm'
access_token = '1456399642103517184-N6FBPIuzHvj5AjUBiIaFDqlcP6kWc6'
access_token_secret = 'kOoQy5iyBd2I7ZIlo4ntPtex9lDps8KVTlvRDl3fyhHs7'

#Get access to the Twitter API to fetch tweets
auth = tweepy.OAuthHandler(consumer_key, consumer_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

In [None]:
#Fetching the needed info from the api (Tweet, Location and Date)
tweets_data = utils.ids_to_tweets('/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/me_too_tweets/metoo_project_full_dataset_01.txt', api)

### <span style="color:green">B- Traumatic Events Dataset</span>

#### <span style="color:magenta">1- Query WikiData</span>

#### <span style="color:magenta">2- Manual Search</span>

### <span style="color:green">C- Non-Traumatic Events Dataset</span>

#### <span style="color:magenta">1- Movie Releases</span>

#### <span style="color:magenta">2- Events/Speeches</span>

### <span style="color:Blue">4) Correlation/Significance Analysis</span>

In [None]:
with bz2.open(quote_datasets[5], 'rb') as s_file:
  # target url
  for instance in s_file:
    instance = json.loads(instance) # loading a sample
    urls = instance['urls'] # extracting list of links
    # print(urls)
    domains = []
    for url in urls:
        tld = parse_title(url)
        domains.append(tld)
        print(tld.path.split('/')[1])
  
# # displaying the title
# print("Title of the website is : ")
# print (soup.title.get_text())