# Master Project

This notebook contains the totality of code used for my Master Project.

## Table of contents

# Data collection

I used the `twitterscraper` package with the following query.

```
twitterscraper volkswagen -bd 2015-09-01 -ed 2015-10-15 -o 01_09_2015-15_10_2015.json
```

In [64]:
import re
import json

import numpy as np
import pandas as pd
from tqdm import tnrange

from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import WordPunctTokenizer

# Data cleaning

Let's open the raw data.

In [2]:
%%time

try:
    labels = ['fullname', 'html', 'id', 'likes', 'replies', 'retweets', 'text', 'timestamp', 'url', 'user']
    arr = np.load('data.npy')
    df = pd.DataFrame(arr, columns=labels)
except FileNotFoundError:
    with open('01_09_2015-15_10_2015.json', 'r') as file:
        raw = json.load(file)
    df = pd.DataFrame(raw)
    np.save('data.npy', df.values)

CPU times: user 8.21 s, sys: 13 s, total: 21.2 s
Wall time: 29.1 s


We have collected **914 274** tweets.

In [3]:
print(f'Number of tweets collected: {len(df)}')

Number of tweets collected: 914274


A tweet looks like this.

In [10]:
df.iloc[1,]

fullname                                       Rhondas Romance
html         <p class="TweetTextSize js-tweet-text tweet-te...
id                                          640675779253280768
likes                                                        0
replies                                                      0
retweets                                                     0
text         I liked a @YouTube video http://youtu.be/pmzZb...
timestamp                                  2015-09-06T23:59:43
url                 /RhosBookReviews/status/640675779253280768
user                                           RhosBookReviews
Name: 1, dtype: object

To conduct our analysis, we will only keep the `id` and the `text` columns. Later on, we might rely on likes, replies and retweets to give importance to tweets.

In [11]:
df.drop(columns=['fullname', 'html', 'likes', 'replies', 'retweets', 'timestamp', 'url', 'user'], inplace=True)
df.head()

Unnamed: 0,id,text
0,640675797720801280,Out here with my twin! @big_euro #socaleuro #V...
1,640675779253280768,I liked a @YouTube video http://youtu.be/pmzZb...
2,640675589251330048,Triste que ahora estacionen un Volkswagen que ...
3,640675540794515456,I liked a @YouTube video http://youtu.be/D-He3...
4,640675381922656256,I liked a @YouTube video http://youtu.be/8ZhGp...


## Cleaning and removing non english tweets

We removed the links.

In [12]:
print('Original:')
print(df['text'][1])
print(df['text'][18])
print('Transformed:')
print(re.sub(r'(https?:\/\/|pic.twitter)[^\s]+', '', df['text'][1]))
print(re.sub(r'(https?:\/\/|pic.twitter)[^\s]+', '', df['text'][18]))

Original:
I liked a @YouTube video http://youtu.be/pmzZbUioFAQ?a  2015 Volkswagen Sales Event | “Model Rear End” Passat Commercial
Volkswagen pic.twitter.com/hPnvGJ3Ysi
Transformed:
I liked a @YouTube video   2015 Volkswagen Sales Event | “Model Rear End” Passat Commercial
Volkswagen 


We remove the user tags or hastags but not the text associated with it.

In [13]:
print('Original:')
print(df['text'][0])
print('Transformed:')
print(re.sub(r'(#|@)', '', df['text'][0]))

Original:
Out here with my twin! @big_euro #socaleuro #Volkswagen #b7passat #passat #static #bagride… http://ift.tt/1JZMPmk pic.twitter.com/mVowUAs7Xv
Transformed:
Out here with my twin! big_euro socaleuro Volkswagen b7passat passat static bagride… http://ift.tt/1JZMPmk pic.twitter.com/mVowUAs7Xv


Let's write a function that sums up all the cleaning.

We also write a function to keep only the english tweets.

In [14]:
def cleaner(df):
    # Create a copy
    df_tmp = df.copy()
    
    # Links cleaning
    df_tmp['text'] = df_tmp['text'].str.replace(r'(https?:\/\/|pic.twitter)[^\s]+', '')
    
    # @ and # cleaning
    df_tmp['text'] = df_tmp['text'].str.replace(r'(#|@)', '')
    
    # Remove the numbers
    df_tmp['text'] = df_tmp['text'].str.replace(r'\d+', '')
    
    # Put everything to lower case
    df_tmp['text'] = df_tmp['text'].str.lower()
    
    # Punctuation
    df_tmp['text'] = df_tmp['text'].str.replace(r'[^\w\s]', ' ')
    
    # Strip whitespaces
    df_tmp['text'] = df_tmp['text'].str.replace(r'\s{2,}', ' ')
    
    return df_tmp

def english_keeper(df):    
    english_tweets = []
    
    for i in tnrange(len(df)):
        text = df['text'].iloc[i]
        try:
            if detect(text) == 'en':
                english_tweets.append((df['id'].iloc[i], text))
        except LangDetectException:
            pass
    
    df_english = pd.DataFrame.from_records(english_tweets, columns=['id', 'text'])
    
    return df_english

In [15]:
splits = np.split(df, 27)
clean_splits = []
labels = ['id', 'text']

for i in tnrange(len(splits)):
    try:
        arr = np.load(f'Processing/split_{i}.npy')
        clean_splits.append(pd.DataFrame(arr, columns=['id', 'text']))
        print(f'Loaded from index {splits[i].index._start} to {splits[i].index._stop}')
    except FileNotFoundError:
        print(f'Cleaning from index {splits[i].index._start} to {splits[i].index._stop}')
        df_clean = cleaner(splits[i])
        df_english = english_keeper(df_clean)
        clean_splits.append(df_english)
        np.save(f'Processing/split_{i}.npy', df_english.values)

HBox(children=(IntProgress(value=0, max=27), HTML(value='')))

Loaded from index 0 to 33862
Loaded from index 33862 to 67724
Loaded from index 67724 to 101586
Loaded from index 101586 to 135448
Loaded from index 135448 to 169310
Loaded from index 169310 to 203172
Loaded from index 203172 to 237034
Loaded from index 237034 to 270896
Loaded from index 270896 to 304758
Loaded from index 304758 to 338620
Loaded from index 338620 to 372482
Loaded from index 372482 to 406344
Loaded from index 406344 to 440206
Loaded from index 440206 to 474068
Loaded from index 474068 to 507930
Loaded from index 507930 to 541792
Loaded from index 541792 to 575654
Loaded from index 575654 to 609516
Loaded from index 609516 to 643378
Loaded from index 643378 to 677240
Loaded from index 677240 to 711102
Loaded from index 711102 to 744964
Loaded from index 744964 to 778826
Loaded from index 778826 to 812688
Loaded from index 812688 to 846550
Loaded from index 846550 to 880412
Loaded from index 880412 to 914274



With this cleaning, we loose **62%** of the database, resulting in **339 367** tweets remaining. It does insure a better quality of tweets and will reduce computation time.

In [16]:
df_clean = pd.concat(clean_splits)
print(df_clean.shape)

(339367, 2)


In [21]:
df_clean['text'].head()

0    out here with my twin big_euro socaleuro volks...
1    i liked a youtube video volkswagen sales event...
2    hoje eu to feliiiiiz hoje eu to contente o o f...
3    i liked a youtube video volkswagen sales event...
4    what an awesome way to display our dealer sign...
Name: text, dtype: object

## Remove stopwords

In [42]:
print('Original:')
print(df_clean.iloc[4, 1])
print('Transformed:')
print(' '.join([w for w in df_clean.iloc[4, 1].split(' ') if w not in english_stopwords]).strip())

Original:
what an awesome way to display our dealer sign just awesome aircooled_ch new volkswagen sign by 
Transformed:
awesome way display dealer sign awesome aircooled_ch new volkswagen sign


In [46]:
def remove_stopwords(df, add=None):
    english_stopwords = stopwords.words('english')
    
    if type(add) == list:
        for e in add:
            english_stopwords.append(e)
            
    words_tweets = []
    
    for i in tnrange(len(df)):
        text = ' '.join([w for w in df['text'].iloc[i].split(' ') if w not in english_stopwords]).strip()
        words_tweets.append((df['id'].iloc[i], text))

    df_words = pd.DataFrame.from_records(words_tweets, columns=['id', 'text'])
    
    return df_words

In [48]:
df_words = remove_stopwords(df_clean, add=['volkswagen'])

HBox(children=(IntProgress(value=0, max=339367), HTML(value='')))

## Stemming

In [65]:
sno = SnowballStemmer('english')

print('Original:')
print(df_words.iloc[13, 1])
print('Transformed:')
print(' '.join([sno.stem(w) for w in df_words.iloc[13, 1].split(' ')]).strip())

Original:
bringing new tiguan frankfurt filed plants manufacturing frankfurt motor show vol car
Transformed:
bring new tiguan frankfurt file plant manufactur frankfurt motor show vol car


In [66]:
def stemming(df):
    sno = SnowballStemmer('english')
    
    stemmed_tweets = []
    
    for i in tnrange(len(df)):
        text = ' '.join([sno.stem(w) for w in df['text'].iloc[i].split(' ')]).strip()
        stemmed_tweets.append((df['id'].iloc[i], text))

    df_stem = pd.DataFrame.from_records(stemmed_tweets, columns=['id', 'text'])
    
    return df_stem

In [67]:
df_stem = stemming(df_words)

HBox(children=(IntProgress(value=0, max=339367), HTML(value='')))

In [71]:
df_stem['text'].unique

<bound method Series.unique of 0         twin big_euro socaleuro bpassat passat static ...
1         like youtub video sale event rear passat commerci
2         hoje eu feliiiiiz hoje eu content fusca vw vwa...
3          like youtub video sale event deal jetta commerci
4         awesom way display dealer sign awesom aircoole...
5         like youtub video sale event like hot passat c...
6            golf want car formal attitud advanc featur ncn
7         bet long term russian growth new engin plant m...
8         say believ superbeetl one greatest car ever built
9         could car want kind car kia picanto aw mini co...
10        kombi vw vwar vwair aircool splitbuss splitwin...
11        gerherbert optical_laura work neil drive demis...
12                                           theyoung jetta
13        bring new tiguan frankfurt file plant manufact...
14        carsal vwbeetl beetl classic convert vw beetl ...
15        chaotic line gtibythewat water vw gti ride cal...
16       

In [72]:
df_stem.iloc[339361, 1]

'winterkorn person deepli sorri broken trust custom jdpower carnew'

In [73]:
df_stem.iloc[1, 1]

'like youtub video sale event rear passat commerci'