# Master Project

This notebook contains the totality of code used for my Master Project.

## Table of contents

# Data collection

I used the `twitterscraper` package with the following query.

```
twitterscraper volkswagen -bd 2015-09-01 -ed 2015-10-15 -o 01_09_2015-15_10_2015.json
```

In [1]:
import re
import json

import numpy as np
import pandas as pd
from tqdm import tnrange

from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

from nltk.tokenize import WordPunctTokenizer

# Data cleaning

Let's open the raw data.

In [2]:
%%time

try:
    labels = ['fullname', 'html', 'id', 'likes', 'replies', 'retweets', 'text', 'timestamp', 'url', 'user']
    arr = np.load('data.npy')
    df = pd.DataFrame(arr, columns=labels)
except FileNotFoundError:
    with open('01_09_2015-15_10_2015.json', 'r') as file:
        raw = json.load(file)
    df = pd.DataFrame(raw)
    np.save('data.npy', df.values)

CPU times: user 8.07 s, sys: 12.5 s, total: 20.6 s
Wall time: 27 s


We have collected **914 274** tweets.

In [3]:
print('Number of tweets collected: {0}'.format(len(df)))

Number of tweets collected: 914274


A tweet looks like this.

In [4]:
df.iloc[1,]

fullname                                       Rhondas Romance
html         <p class="TweetTextSize js-tweet-text tweet-te...
id                                          640675779253280768
likes                                                        0
replies                                                      0
retweets                                                     0
text         I liked a @YouTube video http://youtu.be/pmzZb...
timestamp                                  2015-09-06T23:59:43
url                 /RhosBookReviews/status/640675779253280768
user                                           RhosBookReviews
Name: 1, dtype: object

To conduct our analysis, we will only keep the `id` and the `text` columns. Later on, we might rely on likes, replies and retweets to give importance to tweets.

In [5]:
df.drop(columns=['fullname', 'html', 'likes', 'replies', 'retweets', 'timestamp', 'url', 'user'], inplace=True)
df.head()

Unnamed: 0,id,text
0,640675797720801280,Out here with my twin! @big_euro #socaleuro #V...
1,640675779253280768,I liked a @YouTube video http://youtu.be/pmzZb...
2,640675589251330048,Triste que ahora estacionen un Volkswagen que ...
3,640675540794515456,I liked a @YouTube video http://youtu.be/D-He3...
4,640675381922656256,I liked a @YouTube video http://youtu.be/8ZhGp...


We removed the links.

In [6]:
print('Original:')
print(df['text'][1])
print(df['text'][18])
print('Transformed:')
print(re.sub(r'(https?:\/\/|pic.twitter)[^\s]+', '', df['text'][1]))
print(re.sub(r'(https?:\/\/|pic.twitter)[^\s]+', '', df['text'][18]))

Original:
I liked a @YouTube video http://youtu.be/pmzZbUioFAQ?a  2015 Volkswagen Sales Event | “Model Rear End” Passat Commercial
Volkswagen pic.twitter.com/hPnvGJ3Ysi
Transformed:
I liked a @YouTube video   2015 Volkswagen Sales Event | “Model Rear End” Passat Commercial
Volkswagen 


We remove the user tags or hastags but not the text associated with it.

In [7]:
print('Original:')
print(df['text'][0])
print('Transformed:')
print(re.sub(r'(#|@)', '', df['text'][0]))

Original:
Out here with my twin! @big_euro #socaleuro #Volkswagen #b7passat #passat #static #bagride… http://ift.tt/1JZMPmk pic.twitter.com/mVowUAs7Xv
Transformed:
Out here with my twin! big_euro socaleuro Volkswagen b7passat passat static bagride… http://ift.tt/1JZMPmk pic.twitter.com/mVowUAs7Xv


Let's write a function that sums up all the cleaning.

We also write a function to keep only the english tweets.

In [8]:
def cleaner(df):
    # Create a copy
    df_tmp = df.copy()
    
    # Links cleaning
    df_tmp['text'] = df_tmp['text'].str.replace(r'(https?:\/\/|pic.twitter)[^\s]+', '')
    
    # @ and # cleaning
    df_tmp['text'] = df_tmp['text'].str.replace(r'(#|@)', '')
    
    # Remove the numbers
    df_tmp['text'] = df_tmp['text'].str.replace(r'\d+', '')
    
    # Put everything to lower case
    df_tmp['text'] = df_tmp['text'].str.lower()
    
    # Punctuation
    df_tmp['text'] = df_tmp['text'].str.replace(r'[^\w\s]', ' ')
    
    # Strip whitespaces
    df_tmp['text'] = df_tmp['text'].str.replace(r'\s{2,}', ' ')
    
    return df_tmp

def english_keeper(df):    
    english_tweets = []
    
    for i in tnrange(len(df)):
        text = df['text'].iloc[i]
        try:
            if detect(text) == 'en':
                english_tweets.append((df['id'].iloc[i], text))
        except LangDetectException:
            pass
    
    df_english = pd.DataFrame.from_records(english_tweets, columns=['id', 'text'])
    
    return df_english

In [9]:
splits = np.split(df, 27)
clean_splits = []
labels = ['id', 'text']

for i in tnrange(len(splits)):
    try:
        arr = np.load(f'Processing/split_{i}.npy')
        clean_splits.append(pd.DataFrame(arr, columns=['id', 'text']))
        print(f'Loaded from index {splits[i].index._start} to {splits[i].index._stop}')
    except FileNotFoundError:
        print(f'Cleaning from index {splits[i].index._start} to {splits[i].index._stop}')
        df_clean = cleaner(splits[i])
        df_english = english_keeper(df_clean)
        clean_splits.append(df_english)
        np.save(f'Processing/split_{i}.npy', df_english.values)

HBox(children=(IntProgress(value=0, max=27), HTML(value='')))

Loaded from index 0 to 33862
Cleaning from index 33862 to 67724


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 67724 to 101586


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 101586 to 135448


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 135448 to 169310


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 169310 to 203172


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 203172 to 237034


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 237034 to 270896


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 270896 to 304758


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 304758 to 338620


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 338620 to 372482


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 372482 to 406344


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 406344 to 440206


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 440206 to 474068


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 474068 to 507930


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 507930 to 541792


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 541792 to 575654


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 575654 to 609516


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 609516 to 643378


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 643378 to 677240


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 677240 to 711102


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 711102 to 744964


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 744964 to 778826


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 778826 to 812688


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 812688 to 846550


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 846550 to 880412


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))

Cleaning from index 880412 to 914274


HBox(children=(IntProgress(value=0, max=33862), HTML(value='')))




With this cleaning, we loose **62%** of the database, resulting in **339367** tweets remaining. It does insure a better quality of tweets and will reduce computation time.

In [25]:
df_clean = pd.concat(clean_splits)
print(df_clean.shape)

(339367, 2)
