<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Load-Packages" data-toc-modified-id="Load-Packages-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load Packages</a></span></li><li><span><a href="#Checking-Collected-Data-(user-timeline-Tweets-data)" data-toc-modified-id="Checking-Collected-Data-(user-timeline-Tweets-data)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Checking Collected Data (user timeline Tweets data)</a></span></li><li><span><a href="#Loading-and-Processing-Selected-Data" data-toc-modified-id="Loading-and-Processing-Selected-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Loading and Processing Selected Data</a></span><ul class="toc-item"><li><span><a href="#Which-columns-are-present-in-both-older-and-most-recent-data?" data-toc-modified-id="Which-columns-are-present-in-both-older-and-most-recent-data?-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Which columns are present in both older and most recent data?</a></span></li><li><span><a href="#Which-ones-are-missing-in-the-older-data?" data-toc-modified-id="Which-ones-are-missing-in-the-older-data?-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Which ones are missing in the older data?</a></span></li></ul></li><li><span><a href="#Insert-language-Information-to-Older-Data-(retrieved-on-16th-June-2020)" data-toc-modified-id="Insert-language-Information-to-Older-Data-(retrieved-on-16th-June-2020)-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Insert language Information to Older Data (retrieved on 16th June 2020)</a></span><ul class="toc-item"><li><span><a href="#Evaluating-Some-Language-Detectors" data-toc-modified-id="Evaluating-Some-Language-Detectors-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Evaluating Some Language Detectors</a></span><ul class="toc-item"><li><span><a href="#Detecting-language-using-detect_language-from-TextBlob" data-toc-modified-id="Detecting-language-using-detect_language-from-TextBlob-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Detecting language using detect_language from TextBlob</a></span></li><li><span><a href="#Detecting-language-using-detect_language-from-TextBlob" data-toc-modified-id="Detecting-language-using-detect_language-from-TextBlob-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>Detecting language using detect_language from <code>TextBlob</code></a></span></li><li><span><a href="#Detecting-language-using-LangID" data-toc-modified-id="Detecting-language-using-LangID-5.1.3"><span class="toc-item-num">5.1.3&nbsp;&nbsp;</span>Detecting language using <code>LangID</code></a></span></li><li><span><a href="#Detecting-language-using-TextCat" data-toc-modified-id="Detecting-language-using-TextCat-5.1.4"><span class="toc-item-num">5.1.4&nbsp;&nbsp;</span>Detecting language using <code>TextCat</code></a></span></li></ul></li></ul></li><li><span><a href="#Concatenating-Older-and-Most-Recent-Data" data-toc-modified-id="Concatenating-Older-and-Most-Recent-Data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Concatenating Older and Most Recent Data</a></span></li><li><span><a href="#Processing-Twitter-Search-Data" data-toc-modified-id="Processing-Twitter-Search-Data-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Processing Twitter Search Data</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></div>

# Introduction

Because I’ve been collecting Tweets since I started this project, I though I could expand a bit more the time interval I’m covering with my last data.

The last time I’ve modified and tested [01-collecting_and_saving_tweets.ipynb](http://localhost:8889/notebooks/twitter_analysis_online_grocery_NL/notebooks/01-collecting_and_saving_tweets.ipynb) was June 24th, 2020. And then I covered the following intervals:

**AH:** April 8th till June 24th, 2020

**Jumbo:** March 10, 2020 till June 24th, 2020

**Picnic:** December 7th, 2018 till June 24th, 2020

For a fair comparison we need to use the same time frame for all 3 supermarkets this would currently limit us to the range `April 8th till June 24th, 2020` which is the time frame covered by AH.
As we will see by incorporating the 1st User Timeline Tweets, I’m able to add 9 days, i.e., start from March 30th.
The main goal of this notebook is to have Tweets we  collected both from user timeline as well as applying some queries ready to be analyzed.
In order to achieve this goal, we are performing the following steps in this notebook.

1.	**[Check collected data:](#Checking-Collected-Data-(user-timeline-Tweets-data))** Check which one of the older files would be convenient to combine with the most recent from User time line tweets from @albertheijn
2.	**[Load and process selected data:](#Loading-and-Processing-Selected-Data):** Here we load the older data select in step 1 and the most recent data. Some pre-processing is made here but the most important addition to the data is made in the next step.
3. **[Insert `language`:](#Insert-language-Information-to-Older-Data-(retrieved-on-16th-June-2020)):** We need to insert `language` of the Tweets to the older data. `language` was not included in the first Tweet’s data and it will be important for sentiment analysis since most probable Dutch will be the language used.
3.	**[Evaluate some language detectors:](#Evaluating-Some-Language-Detectors)** Because we need to use a language detector, I’ll use this opportunity to evaluated 4 of them.
4.	**[Concatenate Data:](#Concatenating-Older-and-Most-Recent-Data)**: Concatenate AH Timeline Tweets and save in .csv.
5.	**[Process Search Tweets Data](#Processing-Twitter-Search-Data)**: Concatenate the result obtained by using GetSearch by query so we can analyze them.
6. **[Conclusions:](#Conclusions):** We close this notebook with some summary and comments.



# Load Packages

In [2]:
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import time

TodaysDate = time.strftime("%Y-%m-%d")

# Checking Collected Data (user timeline Tweets data)

In [3]:
def search_file_in_folder(folder,str_file, type_file='csv'):
    """ Given a folder and a part of file's name outputs a list of files paths"""
    
    list_files_paths = []
    for file_path in glob.glob(folder+'*'+str_file+'*.'+type_file):
        try:
            list_files_paths.append(file_path)
        except:
            pass
    
    if len(list_files_paths):
        return list_files_paths
    else:
        return 'No files containing {}'.format(str_file)

In [4]:
def create_dataframe_info(result):
    """ Create a dataframe with tweet's data with path of the .csv file, min and max create_at date, 
    number of tweets, and number of columns """
    
    filepath_list = []
    min_created_list = []
    max_created_list = []
    n_tweet_list = []
    n_columns = []
    
    for file in result:
        df = pd.read_csv(file)
        df['created_at'] = pd.to_datetime(df['created_at'], infer_datetime_format=True)
        filepath_list.append(file)
        min_created_list.append(min(df['created_at']))
        max_created_list.append(max(df['created_at']))
        n_tweet_list.append(df.shape[0])
        n_columns.append(df.shape[1])
        
    dict_df = {'file_path':filepath_list,
              'min_created_list':min_created_list,
              'max_created_list':max_created_list,
              'n_tweet_list':n_tweet_list,
              'n_columns':n_columns}
    
    df_new = pd.DataFrame(dict_df)
                
    return df_new

In [5]:
folder = "../data/tweets/"
result = search_file_in_folder(folder, 'albertheijn')
result

['../data/tweets\\albertheijn_2020-06-16-21-58.csv',
 '../data/tweets\\albertheijn_2020-06-16-22-00.csv',
 '../data/tweets\\albertheijn_2020-06-16-22-18.csv',
 '../data/tweets\\albertheijn_2020-06-21-00-49.csv',
 '../data/tweets\\albertheijn_2020-06-21-00-53.csv',
 '../data/tweets\\albertheijn_2020-06-21-14-42.csv',
 '../data/tweets\\albertheijn_2020-06-21-14-43.csv',
 '../data/tweets\\albertheijn_2020-06-21-14-44.csv',
 '../data/tweets\\albertheijn_2020-06-21-14-45.csv',
 '../data/tweets\\albertheijn_2020-06-22-15-02.csv',
 '../data/tweets\\albertheijn_2020-06-22-15-03.csv',
 '../data/tweets\\albertheijn_2020-06-24-17-10.csv']

In [6]:
df_info_AH = create_dataframe_info(result).sort_values(by=['min_created_list','n_tweet_list'])
df_info_AH

Unnamed: 0,file_path,min_created_list,max_created_list,n_tweet_list,n_columns
0,../data/tweets\albertheijn_2020-06-16-21-58.csv,2020-03-30 08:04:02+00:00,2020-06-16 16:36:50+00:00,3243,14
1,../data/tweets\albertheijn_2020-06-16-22-00.csv,2020-03-30 08:04:02+00:00,2020-06-16 16:36:50+00:00,3243,14
2,../data/tweets\albertheijn_2020-06-16-22-18.csv,2020-03-30 08:04:02+00:00,2020-06-16 16:36:50+00:00,3243,14
3,../data/tweets\albertheijn_2020-06-21-00-49.csv,2020-04-03 15:35:47+00:00,2020-06-20 18:18:50+00:00,3204,19
4,../data/tweets\albertheijn_2020-06-21-00-53.csv,2020-04-03 15:35:47+00:00,2020-06-20 18:18:50+00:00,3204,19
6,../data/tweets\albertheijn_2020-06-21-14-43.csv,2020-04-03 15:35:47+00:00,2020-06-21 12:27:32+00:00,3218,19
7,../data/tweets\albertheijn_2020-06-21-14-44.csv,2020-04-03 15:35:47+00:00,2020-06-21 12:27:32+00:00,3218,19
8,../data/tweets\albertheijn_2020-06-21-14-45.csv,2020-04-03 15:35:47+00:00,2020-06-21 12:45:16+00:00,3219,19
9,../data/tweets\albertheijn_2020-06-22-15-02.csv,2020-04-06 16:33:02+00:00,2020-06-22 11:43:16+00:00,3217,19
10,../data/tweets\albertheijn_2020-06-22-15-03.csv,2020-04-06 16:33:02+00:00,2020-06-22 11:43:16+00:00,3217,19


In [7]:
df_info_AH.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 5
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   file_path         12 non-null     object             
 1   min_created_list  12 non-null     datetime64[ns, UTC]
 2   max_created_list  12 non-null     datetime64[ns, UTC]
 3   n_tweet_list      12 non-null     int64              
 4   n_columns         12 non-null     int64              
dtypes: datetime64[ns, UTC](2), int64(2), object(1)
memory usage: 576.0+ bytes


We have 12 .csv files containing AH's user timeline Tweets data. There are 3 files from June 16th, 2020 all having 3243 tweets and 14 columns (features). The newest file was collect on June 24th, 2020. It contains 3227 tweets and 21 features.

By concatenating `albertheijn_2020-06-16-21-58.csv` and `albertheijn_2020-06-24-17-10.csv` we will be covering the time interval that goes from `30th March 2020 until 22nd June 2020`.

# Loading and Processing Selected Data

In [8]:
# loading older data (back to 30th March)
df_AH_2020_06_16 = pd.read_csv(df_info_AH.loc[0,'file_path'])
# before checking for difference in the columns between old and new data I'll rename handle to screen_name since both are the same
df_AH_2020_06_16.rename(columns={'handle':'screen_name'},inplace=True)
df_AH_2020_06_16.head()

Unnamed: 0,mined_at,screen_name,tweet_id,tweet_id_str,created_at,year,month,day,day_of_week,hour,minute,retweet_count,source,text
0,2020-06-16 21:58:04.280117,albertheijn,1244535843135672326,1244535843135672326,2020-03-30 08:04:02+00:00,2020,3,30,0,8,4,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...","@derots Voorraad is er genoeg, het is voor ons..."
1,2020-06-16 21:58:04.280117,albertheijn,1244538454890987523,1244538454890987523,2020-03-30 08:14:24+00:00,2020,3,30,0,8,14,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@MoniquevDBurgh We doen er alles aan om zoveel...
2,2020-06-16 21:58:04.280117,albertheijn,1244540668225126401,1244540668225126401,2020-03-30 08:23:12+00:00,2020,3,30,0,8,23,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@AnnekeVisser15 Klopt! De Persoonlijke Bonus w...
3,2020-06-16 21:58:04.012317,albertheijn,1244541424588251141,1244541424588251141,2020-03-30 08:26:12+00:00,2020,3,30,0,8,26,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@waltervantiel We kopen groenten en fruit z...
4,2020-06-16 21:58:04.012317,albertheijn,1244542564344238083,1244542564344238083,2020-03-30 08:30:44+00:00,2020,3,30,0,8,30,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@bbstring Je kunt ons het beste een privéberic...


In [9]:
df_AH_2020_06_16.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3243 entries, 0 to 3242
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   mined_at       3243 non-null   object
 1   screen_name    3243 non-null   object
 2   tweet_id       3243 non-null   int64 
 3   tweet_id_str   3243 non-null   int64 
 4   created_at     3243 non-null   object
 5   year           3243 non-null   int64 
 6   month          3243 non-null   int64 
 7   day            3243 non-null   int64 
 8   day_of_week    3243 non-null   int64 
 9   hour           3243 non-null   int64 
 10  minute         3243 non-null   int64 
 11  retweet_count  3243 non-null   int64 
 12  source         3243 non-null   object
 13  text           3243 non-null   object
dtypes: int64(9), object(5)
memory usage: 354.8+ KB


In [10]:
# changing data type of 'created_at' to datetime

df_AH_2020_06_16['created_at'] = pd.to_datetime(df_AH_2020_06_16['created_at'], infer_datetime_format=True)

In [11]:
min(df_AH_2020_06_16['created_at']),max(df_AH_2020_06_16['created_at'])

(Timestamp('2020-03-30 08:04:02+0000', tz='UTC'),
 Timestamp('2020-06-16 16:36:50+0000', tz='UTC'))

Now we load the most recent data (until June 24th).

In [12]:
df_AH_2020_06_24 = pd.read_csv(df_info_AH.loc[11,'file_path'])
df_AH_2020_06_24.head()

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,screen_name,tweet_id,...,retweet_count,favorite_count,source,hashtags,urls,language,user_favourites_count,followers_count,friends_count,text
0,2020-06-24 17:10:43.175060,2020-04-08 08:47:29+00:00,2020,4,8,2,8,47,albertheijn,1247808270837800961,...,0,,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",[],[],nl,580,45540,6,@SpectrumRebel Toch komen er momentjes bij. Ho...
1,2020-06-24 17:10:43.175060,2020-04-08 08:50:43+00:00,2020,4,8,2,8,50,albertheijn,1247809085962997762,...,0,,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",[],[],nl,580,45540,6,@MarianMons Hi Marian. Oeh helaas kan ik dat n...
2,2020-06-24 17:10:43.175060,2020-04-08 08:53:00+00:00,2020,4,8,2,8,53,albertheijn,1247809660150853632,...,0,,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",[],[],nl,580,45540,6,@MarianMons Met een bezorgbundel kan je verder...
3,2020-06-24 17:10:43.175060,2020-04-08 08:59:40+00:00,2020,4,8,2,8,59,albertheijn,1247811338547728384,...,0,,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",[],"[{'expanded_url': 'http://ah.nl/bezorgbundel',...",nl,580,45540,6,@molislaegers Tip! Je kunt als bezorgbundelkla...
4,2020-06-24 17:10:43.175060,2020-04-08 09:01:14+00:00,2020,4,8,2,9,1,albertheijn,1247811730329284609,...,0,,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",[],[],nl,580,45540,6,@public_insulter Niet iedereen dacht er helaas...


In [13]:
df_AH_2020_06_24.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3227 entries, 0 to 3226
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   mined_at               3227 non-null   object 
 1   created_at             3227 non-null   object 
 2   year                   3227 non-null   int64  
 3   month                  3227 non-null   int64  
 4   day                    3227 non-null   int64  
 5   day_of_week            3227 non-null   int64  
 6   hour                   3227 non-null   int64  
 7   minute                 3227 non-null   int64  
 8   screen_name            3227 non-null   object 
 9   tweet_id               3227 non-null   int64  
 10  tweet_id_str           3227 non-null   int64  
 11  retweet_count          3227 non-null   int64  
 12  favorite_count         26 non-null     float64
 13  source                 3227 non-null   object 
 14  hashtags               3227 non-null   object 
 15  urls

In [14]:
df_AH_2020_06_24['created_at'] = pd.to_datetime(df_AH_2020_06_24['created_at'], infer_datetime_format=True)

In [15]:
df_AH_2020_06_24.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3227 entries, 0 to 3226
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   mined_at               3227 non-null   object             
 1   created_at             3227 non-null   datetime64[ns, UTC]
 2   year                   3227 non-null   int64              
 3   month                  3227 non-null   int64              
 4   day                    3227 non-null   int64              
 5   day_of_week            3227 non-null   int64              
 6   hour                   3227 non-null   int64              
 7   minute                 3227 non-null   int64              
 8   screen_name            3227 non-null   object             
 9   tweet_id               3227 non-null   int64              
 10  tweet_id_str           3227 non-null   int64              
 11  retweet_count          3227 non-null   int64            

## Which columns are present in both older and most recent data?

In [16]:
common_columns = list(set(df_AH_2020_06_16.columns).intersection(set(df_AH_2020_06_24.columns)))
common_columns.sort()
common_columns

['created_at',
 'day',
 'day_of_week',
 'hour',
 'mined_at',
 'minute',
 'month',
 'retweet_count',
 'screen_name',
 'source',
 'text',
 'tweet_id',
 'tweet_id_str',
 'year']

In [17]:
# this agrees with what we expected
len(common_columns)

14

## Which ones are missing in the older data?

In [18]:
list(set(df_AH_2020_06_24.columns).difference(set(df_AH_2020_06_16.columns)))

['friends_count',
 'language',
 'favorite_count',
 'urls',
 'hashtags',
 'followers_count',
 'user_favourites_count']

When concatenation I'll keep all columns since `friends_count`, `user_favourites_count`, `favorite_count`, `followers_count` can be used in EDA only considering dates available in the most recent data. 

For the sentiment analysis part `language` is a important piece of information. In addition, I want to maximize the amount of data available within COVID-19 period. Therefore, before concatenating I'll insert this information to the data retrieved on June 16th.

Because there is a overlap in period covered by the 2 dataframes I'll first keep only the days present in the data retrieve on  June 16th that is not present in the most recent data retrieve on June 24th. This will save us processing time.

In [19]:
# Keeping only days not in the most recent data
df_AH_2020_06_16 = df_AH_2020_06_16[df_AH_2020_06_16['created_at'] <= min(df_AH_2020_06_24['created_at'])]

In [20]:
df_AH_2020_06_16.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 256 entries, 0 to 255
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   mined_at       256 non-null    object             
 1   screen_name    256 non-null    object             
 2   tweet_id       256 non-null    int64              
 3   tweet_id_str   256 non-null    int64              
 4   created_at     256 non-null    datetime64[ns, UTC]
 5   year           256 non-null    int64              
 6   month          256 non-null    int64              
 7   day            256 non-null    int64              
 8   day_of_week    256 non-null    int64              
 9   hour           256 non-null    int64              
 10  minute         256 non-null    int64              
 11  retweet_count  256 non-null    int64              
 12  source         256 non-null    object             
 13  text           256 non-null    object             

Therefore, we have 256 new data points from 30th March until April 8th.

In [21]:
min(df_AH_2020_06_16['created_at']),max(df_AH_2020_06_16['created_at'])

(Timestamp('2020-03-30 08:04:02+0000', tz='UTC'),
 Timestamp('2020-04-08 08:47:29+0000', tz='UTC'))

In [22]:
min(df_AH_2020_06_24['created_at'])

Timestamp('2020-04-08 08:47:29+0000', tz='UTC')

# Insert language Information to Older Data (retrieved on 16th June 2020)

There are some ways we can extract language information from the text:

1. [detect_language() from TextBlob()](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.TextBlob.detect_language). It uses a Google API and requires Internet access.

I've tried but it didn't allow to have all 205 entries evaluated. I've got a `HTTPError: HTTP Error 429: Too Many Requests`

2. [langdetect Python library](https://pypi.org/project/langdetect/): Port of [Google’s language-detection library](https://code.google.com/archive/p/language-detection/) to Python.

3. [LangID](https://pypi.org/project/langid/1.1dev/):  Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062 


4. [TextCat](https://www.nltk.org/_modules/nltk/classify/textcat.html): Cavnar, W. B. and J. M. Trenkle, ``[N-Gram-Based Text Categorization](http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf)'' In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.


However, we need to keep in mind that automatic language identifiers are very error prone, especially on very short texts. Let's check out all and see what happens. 


## Evaluating Some Language Detectors

### Detecting language using detect_language from TextBlob

In [23]:
def detect_lang_textblob(text):
    """ Detect language of text using TextBlob that uses Google API and requires internet conection"""
    
    from textblob import TextBlob
    
    b = TextBlob(text)
     
    return b.detect_language()

In [24]:
df_AH_2020_06_16['lang_tb'] = df_AH_2020_06_16['text'].apply(lambda x: detect_lang_textblob(x))

HTTPError: HTTP Error 429: Too Many Requests

`HTTPError: HTTP Error 429: Too Many Requests` with TextBlob.

### Detecting language using detect_language from `TextBlob`

In [25]:
def detect_lang_langdetect(text):
    """ Detect language of text using langdetect"""
    
    from langdetect import detect
     
    return detect(text)

In [26]:
df_AH_2020_06_16['lang_ld'] = df_AH_2020_06_16['text'].apply(lambda x: detect_lang_langdetect(x))

In [27]:
df_AH_2020_06_16['lang_ld'].value_counts()

nl    245
af      4
sw      1
fr      1
tr      1
it      1
en      1
sl      1
de      1
Name: lang_ld, dtype: int64

### Detecting language using `LangID`

In [28]:
def detect_lang_langid(text):
    """ Detect language of text using langid"""
    
    import langid
     
    return langid.classify(text)[0]

In [29]:
df_AH_2020_06_16['lang_lid'] = df_AH_2020_06_16['text'].apply(lambda x: detect_lang_langid(x))

In [30]:
df_AH_2020_06_16['lang_lid'].value_counts()

nl    244
de      2
af      1
ms      1
no      1
am      1
fr      1
en      1
tr      1
lb      1
he      1
mt      1
Name: lang_lid, dtype: int64

### Detecting language using `TextCat`

In [31]:
def detect_lang_textcat(text):
    """ Detect language of text using TextCat"""
    
    from nltk.classify.textcat import TextCat
    
    tc = TextCat()
     
    return tc.guess_language(text)

In [32]:
df_AH_2020_06_16['lang_tc'] = df_AH_2020_06_16['text'].apply(lambda x: detect_lang_textcat(x))

In [33]:
df_AH_2020_06_16['lang_tc'].value_counts()

sun     119
eng      31
fri      30
afr      26
nds      16
vls      15
nld       7
ces       3
deu       3
dan       1
swe       1
fra       1
sot       1
eng       1
pol       1
Name: lang_tc, dtype: int64

Well, I have the feeling that TextCat didn't do so well 🤨 . Let's check the results from `langdetect` and `langID` that where different than `nl`.

In [34]:
# to see the complete column 'text'
pd.set_option('display.max_colwidth', -1)

  


In [35]:
df_AH_2020_06_16[['lang_ld','lang_lid','text']][(df_AH_2020_06_16['lang_ld']!='nl')|(df_AH_2020_06_16['lang_lid']!='nl')]

Unnamed: 0,lang_ld,lang_lid,text
5,nl,no,"@Bryan65165100 Werkze, collega!💙 #wedoenhetsamen ^Job"
41,it,am,@MonitorNL 👍 ^Robin
109,sw,fr,@MataKalikamba Dank! 😊 ^Stéphanie
125,af,nl,@LAvanNuil Wat jammer! Om welk filiaal gaat dit? ^Ivie
145,en,en,"@Bayon_Silvia Hi Silvia, I'm so sorry! I'm afraid I cannot do anything about this, but if you send me the email address of your account in a private message, I will take a closer look at it. ^Zoë"
197,nl,lb,"@huhesas Hmm, smakelijk eten alvast 😊 ^Stéphanie"
199,af,nl,@babbelaar4life Hoi Babbelaar! Zou je mij willen laten weten in welke winkel dit was? Dan breng ik de winkel op de hoogte. ^Yasmine
200,sl,nl,"@malmostoso Hi! Oei, jazeker, ga ik voor je regelen! 😊 ^Tyara"
205,af,af,@myrna_1995 Graag gedaan 😊 ^Stéphanie
230,de,de,@IvonneS38 Bedankt! ^Nurbanu


In [36]:
len(df_AH_2020_06_16[['lang_ld','lang_lid','text']][(df_AH_2020_06_16['lang_ld']!='nl')|(df_AH_2020_06_16['lang_lid']!='nl')])

15

From 15 text above both language detectors didn't agree when saying that was `dutch`. In some cases is difficult to judge the language like, for example in cases like `@martranslations 💙 ^Robin`.

Both agree correctly that `@Bayon_Silvia Hi Silvia, I'm so sorry! I'm afraid I cannot do anything about this, but if you send me the email address of your account in a private message, I will take a closer look at it. ^Zoë` is english.

In general, I'd say that all other cases are indeed `dutch`. 

Which detector was better? Difficult to judge since the difference was 1 (From 256 texts lang_detect considered 245 dutch and langID 244).

In a bigger dataset where checking just like we did is not really an option we could use more than one language detector and apply a majority voting to try to reduce the error.

Here I'll just keep the results of `lang_ld` correcting the manually the wrong results.

In [37]:
# index of the ones not classified as nl.
list_index = list(df_AH_2020_06_16[(df_AH_2020_06_16['lang_ld']!='nl')].index)
list_index.remove(df_AH_2020_06_16[(df_AH_2020_06_16['lang_ld']=='en')].index)
list_index


[41, 109, 125, 199, 200, 205, 230, 232, 238, 251]

In [38]:
# correcting 

df_AH_2020_06_16.loc[list_index,'lang_ld']='nl'

# checking
df_AH_2020_06_16['lang_ld'].value_counts()

nl    255
en    1  
Name: lang_ld, dtype: int64

In [39]:
# copy to lang and drop all lang_

df_AH_2020_06_16['language'] = df_AH_2020_06_16['lang_ld']

df_AH_2020_06_16.drop(columns=['lang_ld', 'lang_lid', 'lang_tc'],inplace=True)

In [40]:
df_AH_2020_06_16.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 256 entries, 0 to 255
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   mined_at       256 non-null    object             
 1   screen_name    256 non-null    object             
 2   tweet_id       256 non-null    int64              
 3   tweet_id_str   256 non-null    int64              
 4   created_at     256 non-null    datetime64[ns, UTC]
 5   year           256 non-null    int64              
 6   month          256 non-null    int64              
 7   day            256 non-null    int64              
 8   day_of_week    256 non-null    int64              
 9   hour           256 non-null    int64              
 10  minute         256 non-null    int64              
 11  retweet_count  256 non-null    int64              
 12  source         256 non-null    object             
 13  text           256 non-null    object             

In [41]:
pd.set_option('display.max_colwidth', 50)

In [42]:
df_AH_2020_06_16.head()

Unnamed: 0,mined_at,screen_name,tweet_id,tweet_id_str,created_at,year,month,day,day_of_week,hour,minute,retweet_count,source,text,language
0,2020-06-16 21:58:04.280117,albertheijn,1244535843135672326,1244535843135672326,2020-03-30 08:04:02+00:00,2020,3,30,0,8,4,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...","@derots Voorraad is er genoeg, het is voor ons...",nl
1,2020-06-16 21:58:04.280117,albertheijn,1244538454890987523,1244538454890987523,2020-03-30 08:14:24+00:00,2020,3,30,0,8,14,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@MoniquevDBurgh We doen er alles aan om zoveel...,nl
2,2020-06-16 21:58:04.280117,albertheijn,1244540668225126401,1244540668225126401,2020-03-30 08:23:12+00:00,2020,3,30,0,8,23,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@AnnekeVisser15 Klopt! De Persoonlijke Bonus w...,nl
3,2020-06-16 21:58:04.012317,albertheijn,1244541424588251141,1244541424588251141,2020-03-30 08:26:12+00:00,2020,3,30,0,8,26,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@waltervantiel We kopen groenten en fruit z...,nl
4,2020-06-16 21:58:04.012317,albertheijn,1244542564344238083,1244542564344238083,2020-03-30 08:30:44+00:00,2020,3,30,0,8,30,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@bbstring Je kunt ons het beste een privéberic...,nl


# Concatenating Older and Most Recent Data

Now we are ready to concatenate all AH data.

In [43]:
min(df_AH_2020_06_24['created_at']),max(df_AH_2020_06_24['created_at'])

(Timestamp('2020-04-08 08:47:29+0000', tz='UTC'),
 Timestamp('2020-06-24 15:06:54+0000', tz='UTC'))

In [44]:
# concatenate dataframes
df_AH_concat = pd.concat([df_AH_2020_06_16,df_AH_2020_06_24])

In [45]:
duplicateRowsDF = df_AH_concat[df_AH_concat.duplicated(['text'])]
duplicateRowsDF

Unnamed: 0,mined_at,screen_name,tweet_id,tweet_id_str,created_at,year,month,day,day_of_week,hour,...,retweet_count,source,text,language,favorite_count,hashtags,urls,user_favourites_count,followers_count,friends_count
0,2020-06-24 17:10:43.175060,albertheijn,1247808270837800961,1247808270837800961,2020-04-08 08:47:29+00:00,2020,4,8,2,8,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@SpectrumRebel Toch komen er momentjes bij. Ho...,nl,,[],[],580.0,45540.0,6.0


As expected tweet from '2020-04-08 08:47:29+00:00' is duplicated. It will be removed by using `drop_duplicates`.

In [46]:
df_AH_concat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3483 entries, 0 to 3226
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   mined_at               3483 non-null   object             
 1   screen_name            3483 non-null   object             
 2   tweet_id               3483 non-null   int64              
 3   tweet_id_str           3483 non-null   int64              
 4   created_at             3483 non-null   datetime64[ns, UTC]
 5   year                   3483 non-null   int64              
 6   month                  3483 non-null   int64              
 7   day                    3483 non-null   int64              
 8   day_of_week            3483 non-null   int64              
 9   hour                   3483 non-null   int64              
 10  minute                 3483 non-null   int64              
 11  retweet_count          3483 non-null   int64            

In [47]:
# eliminate duplicates based on create_at and text, keep will be setted to 'last' since we know that in the older 
# data language will be nan and it is better to keep data that is not nan

df_AH_concat.drop_duplicates(subset=['text'], inplace = True, keep = 'last')

# sorting by 'created_at'
df_AH_concat.sort_values(by='created_at',inplace = True)

# reset index
df_AH_concat.reset_index(drop = True, inplace = True)

# save in csv

df_AH_concat.to_csv("../data/processed/AH_concat_16_and_24_June_"+TodaysDate+".csv", index = False)

In [48]:
df_AH_concat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3482 entries, 0 to 3481
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   mined_at               3482 non-null   object             
 1   screen_name            3482 non-null   object             
 2   tweet_id               3482 non-null   int64              
 3   tweet_id_str           3482 non-null   int64              
 4   created_at             3482 non-null   datetime64[ns, UTC]
 5   year                   3482 non-null   int64              
 6   month                  3482 non-null   int64              
 7   day                    3482 non-null   int64              
 8   day_of_week            3482 non-null   int64              
 9   hour                   3482 non-null   int64              
 10  minute                 3482 non-null   int64              
 11  retweet_count          3482 non-null   int64            

In [49]:
df_AH_concat.head()

Unnamed: 0,mined_at,screen_name,tweet_id,tweet_id_str,created_at,year,month,day,day_of_week,hour,...,retweet_count,source,text,language,favorite_count,hashtags,urls,user_favourites_count,followers_count,friends_count
0,2020-06-16 21:58:04.280117,albertheijn,1244535843135672326,1244535843135672326,2020-03-30 08:04:02+00:00,2020,3,30,0,8,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...","@derots Voorraad is er genoeg, het is voor ons...",nl,,,,,,
1,2020-06-16 21:58:04.280117,albertheijn,1244538454890987523,1244538454890987523,2020-03-30 08:14:24+00:00,2020,3,30,0,8,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@MoniquevDBurgh We doen er alles aan om zoveel...,nl,,,,,,
2,2020-06-16 21:58:04.280117,albertheijn,1244540668225126401,1244540668225126401,2020-03-30 08:23:12+00:00,2020,3,30,0,8,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@AnnekeVisser15 Klopt! De Persoonlijke Bonus w...,nl,,,,,,
3,2020-06-16 21:58:04.012317,albertheijn,1244541424588251141,1244541424588251141,2020-03-30 08:26:12+00:00,2020,3,30,0,8,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@waltervantiel We kopen groenten en fruit z...,nl,,,,,,
4,2020-06-16 21:58:04.012317,albertheijn,1244542564344238083,1244542564344238083,2020-03-30 08:30:44+00:00,2020,3,30,0,8,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@bbstring Je kunt ons het beste een privéberic...,nl,,,,,,


In [50]:
df_AH_concat.tail()

Unnamed: 0,mined_at,screen_name,tweet_id,tweet_id_str,created_at,year,month,day,day_of_week,hour,...,retweet_count,source,text,language,favorite_count,hashtags,urls,user_favourites_count,followers_count,friends_count
3477,2020-06-24 17:10:38.375439,albertheijn,1275775482974388224,1275775482974388224,2020-06-24 12:59:12+00:00,2020,6,24,2,12,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@TwiitterLaura Hi! Je mag ook bovenstaande...,nl,,[],[],580.0,45540.0,6.0
3478,2020-06-24 17:10:38.375439,albertheijn,1275779436185600002,1275779436185600002,2020-06-24 13:14:55+00:00,2020,6,24,2,13,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...","@aartnieuwland Wanneer je plakjes kaas koopt, ...",nl,,[],[],580.0,45540.0,6.0
3479,2020-06-24 17:10:38.375439,albertheijn,1275799009026871297,1275799009026871297,2020-06-24 14:32:41+00:00,2020,6,24,2,14,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@zomaareendame Oh jee! Dat kan bij mij. Als je...,nl,,[],[],580.0,45540.0,6.0
3480,2020-06-24 17:10:38.375439,albertheijn,1275800040687239170,1275800040687239170,2020-06-24 14:36:47+00:00,2020,6,24,2,14,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...",@marzman95 Hi! Je kan deze zegels combineren. ...,nl,,[],[],580.0,45540.0,6.0
3481,2020-06-24 17:10:38.375439,albertheijn,1275807617395372034,1275807617395372034,2020-06-24 15:06:54+00:00,2020,6,24,2,15,...,0,"<a href=""https://www.tracebuzz.com"" rel=""nofol...","@Tiezzymeister Hi, ik heb even voor je gekeken...",nl,,[],[],580.0,45540.0,6.0


In [52]:
df_test = pd.read_csv("../data/processed/AH_concat_16_and_24_June_"+TodaysDate+".csv")

In [53]:
df_test.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3482 entries, 0 to 3481
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   mined_at               3482 non-null   object 
 1   screen_name            3482 non-null   object 
 2   tweet_id               3482 non-null   int64  
 3   tweet_id_str           3482 non-null   int64  
 4   created_at             3482 non-null   object 
 5   year                   3482 non-null   int64  
 6   month                  3482 non-null   int64  
 7   day                    3482 non-null   int64  
 8   day_of_week            3482 non-null   int64  
 9   hour                   3482 non-null   int64  
 10  minute                 3482 non-null   int64  
 11  retweet_count          3482 non-null   int64  
 12  source                 3482 non-null   object 
 13  text                   3482 non-null   object 
 14  language               3482 non-null   object 
 15  favo

# Processing Twitter Search Data

Similar to what was done with user timeline tweets we first check the data we have collected using `GetSearch` and them we concatenate the results of each query that was made.

In [54]:
folder = "../data/tweets/"
result = search_file_in_folder(folder, 'query')
result

['../data/tweets\\query_02_2020-06-21-12-17.csv',
 '../data/tweets\\query_02_2020-06-21-12-47.csv',
 '../data/tweets\\query_02_2020-06-21-12-50.csv',
 '../data/tweets\\query_02_2020-06-22-15-15.csv',
 '../data/tweets\\query_02_2020-06-24-17-15.csv',
 '../data/tweets\\query_03_2020-06-21-12-23.csv',
 '../data/tweets\\query_03_2020-06-21-12-35.csv',
 '../data/tweets\\query_03_2020-06-21-12-48.csv',
 '../data/tweets\\query_03_2020-06-21-12-52.csv',
 '../data/tweets\\query_03_2020-06-21-12-59.csv',
 '../data/tweets\\query_03_2020-06-22-15-23.csv',
 '../data/tweets\\query_03_2020-06-24-17-16.csv',
 '../data/tweets\\query_04_2020-06-21-12-37.csv',
 '../data/tweets\\query_04_2020-06-21-13-03.csv',
 '../data/tweets\\query_04_2020-06-22-15-28.csv',
 '../data/tweets\\query_04_2020-06-24-17-16.csv',
 '../data/tweets\\query_05_2020-06-21-13-05.csv',
 '../data/tweets\\query_05_2020-06-22-15-39.csv',
 '../data/tweets\\query_05_2020-06-24-17-16.csv',
 '../data/tweets\\query_06_2020-06-21-13-07.csv',


In [55]:
df_queries_info = create_dataframe_info(result)
df_queries_info

Unnamed: 0,file_path,min_created_list,max_created_list,n_tweet_list,n_columns
0,../data/tweets\query_02_2020-06-21-12-17.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,80,23
1,../data/tweets\query_02_2020-06-21-12-47.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,80,23
2,../data/tweets\query_02_2020-06-21-12-50.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
3,../data/tweets\query_02_2020-06-22-15-15.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
4,../data/tweets\query_02_2020-06-24-17-15.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
5,../data/tweets\query_03_2020-06-21-12-23.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,200,23
6,../data/tweets\query_03_2020-06-21-12-35.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,200,23
7,../data/tweets\query_03_2020-06-21-12-48.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,200,23
8,../data/tweets\query_03_2020-06-21-12-52.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
9,../data/tweets\query_03_2020-06-21-12-59.csv,2020-06-18 11:50:31+00:00,2020-06-18 11:56:42+00:00,2,23


In [56]:
df_queries_info.sort_values('n_tweet_list')

Unnamed: 0,file_path,min_created_list,max_created_list,n_tweet_list,n_columns
10,../data/tweets\query_03_2020-06-22-15-23.csv,2020-06-18 11:50:31+00:00,2020-06-18 11:56:42+00:00,2,23
17,../data/tweets\query_05_2020-06-22-15-39.csv,2020-06-20 17:25:49+00:00,2020-06-21 09:33:32+00:00,2,23
16,../data/tweets\query_05_2020-06-21-13-05.csv,2020-06-20 17:25:49+00:00,2020-06-21 09:33:32+00:00,2,23
9,../data/tweets\query_03_2020-06-21-12-59.csv,2020-06-18 11:50:31+00:00,2020-06-18 11:56:42+00:00,2,23
11,../data/tweets\query_03_2020-06-24-17-16.csv,2020-06-18 11:50:31+00:00,2020-06-18 11:56:42+00:00,2,23
2,../data/tweets\query_02_2020-06-21-12-50.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
3,../data/tweets\query_02_2020-06-22-15-15.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
4,../data/tweets\query_02_2020-06-24-17-15.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
18,../data/tweets\query_05_2020-06-24-17-16.csv,2020-06-20 17:25:49+00:00,2020-06-24 08:38:38+00:00,4,23
8,../data/tweets\query_03_2020-06-21-12-52.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23


In [57]:
df_test = pd.read_csv("../data/tweets/query_02_2020-06-21-12-47.csv")

In [58]:
df_test.drop_duplicates('text').info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 60
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mined_at                 4 non-null      object 
 1   created_at               4 non-null      object 
 2   year                     4 non-null      int64  
 3   month                    4 non-null      int64  
 4   day                      4 non-null      int64  
 5   day_of_week              4 non-null      int64  
 6   hour                     4 non-null      int64  
 7   minute                   4 non-null      int64  
 8   tweet_id                 4 non-null      int64  
 9   tweet_id_str             4 non-null      int64  
 10  in_reply_to_screen_name  2 non-null      object 
 11  in_reply_to_status_id    2 non-null      float64
 12  in_reply_to_user_id      2 non-null      float64
 13  hashtags                 4 non-null      object 
 14  source                   4 no

In [59]:
df_test = pd.read_csv("../data/tweets/query_04_2020-06-21-13-03.csv")

In [60]:
df_test.drop_duplicates('text').info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 14
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mined_at                 14 non-null     object 
 1   created_at               14 non-null     object 
 2   year                     14 non-null     int64  
 3   month                    14 non-null     int64  
 4   day                      14 non-null     int64  
 5   day_of_week              14 non-null     int64  
 6   hour                     14 non-null     int64  
 7   minute                   14 non-null     int64  
 8   tweet_id                 14 non-null     int64  
 9   tweet_id_str             14 non-null     int64  
 10  in_reply_to_screen_name  2 non-null      object 
 11  in_reply_to_status_id    2 non-null      float64
 12  in_reply_to_user_id      2 non-null      float64
 13  hashtags                 14 non-null     object 
 14  source                   14 

As seen in notebook 1 usually there are many duplicates depending on `max_page` when we use `GetSearch`. In the first trials these duplicates were not eliminated. The files containing 80 or more Tweets have mostly duplicates. Therefore, we don't have a substantial number of tweets in our queries.

In [61]:
df_queries_info.sort_values('file_path')

Unnamed: 0,file_path,min_created_list,max_created_list,n_tweet_list,n_columns
0,../data/tweets\query_02_2020-06-21-12-17.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,80,23
1,../data/tweets\query_02_2020-06-21-12-47.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,80,23
2,../data/tweets\query_02_2020-06-21-12-50.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
3,../data/tweets\query_02_2020-06-22-15-15.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
4,../data/tweets\query_02_2020-06-24-17-15.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
5,../data/tweets\query_03_2020-06-21-12-23.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,200,23
6,../data/tweets\query_03_2020-06-21-12-35.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,200,23
7,../data/tweets\query_03_2020-06-21-12-48.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,200,23
8,../data/tweets\query_03_2020-06-21-12-52.csv,2020-06-18 11:50:31+00:00,2020-06-18 15:03:16+00:00,4,23
9,../data/tweets\query_03_2020-06-21-12-59.csv,2020-06-18 11:50:31+00:00,2020-06-18 11:56:42+00:00,2,23


In [62]:
list_files = df_queries_info.sort_values('file_path')['file_path'].tolist()
list_files

['../data/tweets\\query_02_2020-06-21-12-17.csv',
 '../data/tweets\\query_02_2020-06-21-12-47.csv',
 '../data/tweets\\query_02_2020-06-21-12-50.csv',
 '../data/tweets\\query_02_2020-06-22-15-15.csv',
 '../data/tweets\\query_02_2020-06-24-17-15.csv',
 '../data/tweets\\query_03_2020-06-21-12-23.csv',
 '../data/tweets\\query_03_2020-06-21-12-35.csv',
 '../data/tweets\\query_03_2020-06-21-12-48.csv',
 '../data/tweets\\query_03_2020-06-21-12-52.csv',
 '../data/tweets\\query_03_2020-06-21-12-59.csv',
 '../data/tweets\\query_03_2020-06-22-15-23.csv',
 '../data/tweets\\query_03_2020-06-24-17-16.csv',
 '../data/tweets\\query_04_2020-06-21-12-37.csv',
 '../data/tweets\\query_04_2020-06-21-13-03.csv',
 '../data/tweets\\query_04_2020-06-22-15-28.csv',
 '../data/tweets\\query_04_2020-06-24-17-16.csv',
 '../data/tweets\\query_05_2020-06-21-13-05.csv',
 '../data/tweets\\query_05_2020-06-22-15-39.csv',
 '../data/tweets\\query_05_2020-06-24-17-16.csv',
 '../data/tweets\\query_06_2020-06-21-13-07.csv',


In [63]:
def concat_queries(list_files):
    
    df_list = []
    for file in list_files:
        df_list.append(pd.read_csv(file))
    
    df_concat = pd.concat(df_list)
    df_concat.drop_duplicates('text',inplace=True)
    df_concat.reset_index(inplace=True, drop=True)
    
    return df_concat

In [64]:
list_files[16:19]

['../data/tweets\\query_05_2020-06-21-13-05.csv',
 '../data/tweets\\query_05_2020-06-22-15-39.csv',
 '../data/tweets\\query_05_2020-06-24-17-16.csv']

In [65]:
df_concat_query_02 = concat_queries(list_files[:5])
df_concat_query_02

Unnamed: 0,mined_at,tweet_id,tweet_id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,hashtags,source,language,created_at,...,screen_name,user_tweet_id,user_tweet_id_str,user_favourites_count,followers_count,friends_count,text,user_screen_name,user_id,user_location
0,2020-06-21 12:17:07.655257,1273583869753724930,1273583869753724930,jelleprins,1.273583e+18,16546619.0,[],"<a href=""https://mobile.twitter.com"" rel=""nofo...",nl,2020-06-18 11:50:31+00:00,...,GHengeveld,20709019.0,20709019.0,889.0,1484,1750,@jelleprins @picnic @JumboSupermarkt @alberthe...,,,
1,2020-06-21 12:17:09.519252,1273585426268409859,1273585426268409859,Oli4K,1.273584e+18,6091182.0,[],"<a href=""http://twitter.com/download/iphone"" r...",nl,2020-06-18 11:56:42+00:00,...,jelleprins,16546619.0,16546619.0,4416.0,10513,1063,@Oli4K @picnic @JumboSupermarkt @albertheijn L...,,,
2,2020-06-21 12:17:08.403804,1273616458459705347,1273616458459705347,,,,[],"<a href=""https://about.twitter.com/products/tw...",nl,2020-06-18 14:00:01+00:00,...,agfnl,164631799.0,164631799.0,84.0,8530,2128,"""Trendbreuk: Jumbo al 3 weken duurder in AGF d...",,,
3,2020-06-21 12:17:09.120192,1273632376950849537,1273632376950849537,,,,[],"<a href=""http://twitter.com/download/android"" ...",nl,2020-06-18 15:03:16+00:00,...,martiendas,111582731.0,111582731.0,10982.0,4649,5115,Half juni en nog steeds worden boodschappen be...,,,


In [66]:
df_concat_query_02.to_csv("../data/processed/query_02_"+TodaysDate+".csv", index = False)

In [67]:
df_concat_query_03 = concat_queries(list_files[5:12])
df_concat_query_03

Unnamed: 0,mined_at,tweet_id,tweet_id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,hashtags,source,language,created_at,...,screen_name,user_tweet_id,user_tweet_id_str,user_favourites_count,followers_count,friends_count,text,user_screen_name,user_id,user_location
0,2020-06-21 12:21:48.502231,1273583869753724930,1273583869753724930,jelleprins,1.273583e+18,16546619.0,[],"<a href=""https://mobile.twitter.com"" rel=""nofo...",nl,2020-06-18 11:50:31+00:00,...,GHengeveld,20709019.0,20709019.0,889.0,1484,1750,@jelleprins @picnic @JumboSupermarkt @alberthe...,,,
1,2020-06-21 12:21:49.408779,1273585426268409859,1273585426268409859,Oli4K,1.273584e+18,6091182.0,[],"<a href=""http://twitter.com/download/iphone"" r...",nl,2020-06-18 11:56:42+00:00,...,jelleprins,16546619.0,16546619.0,4416.0,10513,1063,@Oli4K @picnic @JumboSupermarkt @albertheijn L...,,,
2,2020-06-21 12:21:44.806809,1273616458459705347,1273616458459705347,,,,[],"<a href=""https://about.twitter.com/products/tw...",nl,2020-06-18 14:00:01+00:00,...,agfnl,164631799.0,164631799.0,84.0,8530,2128,"""Trendbreuk: Jumbo al 3 weken duurder in AGF d...",,,
3,2020-06-21 12:21:51.659489,1273632376950849537,1273632376950849537,,,,[],"<a href=""http://twitter.com/download/android"" ...",nl,2020-06-18 15:03:16+00:00,...,martiendas,111582731.0,111582731.0,10982.0,4649,5115,Half juni en nog steeds worden boodschappen be...,,,


In [68]:
df_concat_query_03.to_csv("../data/processed/query_03_"+TodaysDate+".csv", index = False)

In [69]:
df_concat_query_04 = concat_queries(list_files[12:16])
df_concat_query_04

Unnamed: 0,mined_at,tweet_id,tweet_id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,hashtags,source,language,created_at,...,screen_name,user_tweet_id,user_tweet_id_str,user_favourites_count,followers_count,friends_count,text,user_screen_name,user_id,user_location
0,2020-06-21 12:36:58.007296,1273583869753724930,1273583869753724930,jelleprins,1.273583e+18,16546620.0,[],"<a href=""https://mobile.twitter.com"" rel=""nofo...",nl,2020-06-18 11:50:31+00:00,...,GHengeveld,20709020.0,20709020.0,889.0,1484,1750,@jelleprins @picnic @JumboSupermarkt @alberthe...,,,
1,2020-06-21 12:37:02.634602,1273585426268409859,1273585426268409859,Oli4K,1.273584e+18,6091182.0,[],"<a href=""http://twitter.com/download/iphone"" r...",nl,2020-06-18 11:56:42+00:00,...,jelleprins,16546620.0,16546620.0,4416.0,10513,1063,@Oli4K @picnic @JumboSupermarkt @albertheijn L...,,,
2,2020-06-21 12:36:54.757866,1273616458459705347,1273616458459705347,,,,[],"<a href=""https://about.twitter.com/products/tw...",nl,2020-06-18 14:00:01+00:00,...,agfnl,164631800.0,164631800.0,84.0,8530,2128,"""Trendbreuk: Jumbo al 3 weken duurder in AGF d...",,,
3,2020-06-21 12:37:01.733670,1273632376950849537,1273632376950849537,,,,[],"<a href=""http://twitter.com/download/android"" ...",nl,2020-06-18 15:03:16+00:00,...,martiendas,111582700.0,111582700.0,10982.0,4648,5115,Half juni en nog steeds worden boodschappen be...,,,
4,2020-06-21 13:02:46.884312,1274501643044884480,1274501643044884480,,,,[],"<a href=""http://www.socastdigital.com"" rel=""no...",en,2020-06-21 00:37:25+00:00,...,SeehaferNews,9.668518e+17,9.668518e+17,110.0,173,93,The Kiel Community Picnic Committee has decide...,,,
5,2020-06-21 13:02:49.298622,1274501644005277697,1274501644005277697,,,,[],"<a href=""http://www.socastdigital.com"" rel=""no...",en,2020-06-21 00:37:25+00:00,...,womtam,58450690.0,58450690.0,154.0,416,98,The Kiel Community Picnic Committee has decide...,,,
6,2020-06-21 13:02:45.561162,1274504489744117762,1274504489744117762,,,,[],"<a href=""http://twitter.com/download/iphone"" r...",en,2020-06-21 00:48:44+00:00,...,PalamaJeni,1.038502e+18,1.038502e+18,1088.0,119,131,RT @TrishaFisher681: If Covid cancels Newton P...,,,
7,2020-06-21 13:02:44.795555,1274507812341760000,1274507812341760000,,,,[],"<a href=""http://twitter.com/download/android"" ...",en,2020-06-21 01:01:56+00:00,...,5455km629,3627509000.0,3627509000.0,86318.0,5643,6105,RT @diannemando: @AngelinaWTSP @HealthyFla @10...,,,
8,2020-06-21 13:02:46.335633,1274515327284477955,1274515327284477955,,,,"[{'text': 'SanDiego'}, {'text': 'socialdistanc...","<a href=""https://mobile.twitter.com"" rel=""nofo...",en,2020-06-21 01:31:48+00:00,...,aliveinsv,309284000.0,309284000.0,1879.0,13,133,"#SanDiego. 3 moms, 5 kids at park, all masked....",,,
9,2020-06-21 13:02:44.534285,1274526059174801409,1274526059174801409,MurielBowser,1.274516e+18,245571900.0,[],"<a href=""http://twitter.com/download/iphone"" r...",en,2020-06-21 02:14:26+00:00,...,Platonas96,1.171823e+18,1.171823e+18,895.0,15,117,@MurielBowser @ktumulty @AOC Yes. Both u and A...,,,


In [70]:
df_concat_query_04.to_csv("../data/processed/query_04_"+TodaysDate+".csv", index = False)

In [71]:
df_concat_query_05 = concat_queries(list_files[16:19])
df_concat_query_05

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,tweet_id,tweet_id_str,...,screen_name,user_tweet_id,user_tweet_id_str,user_favourites_count,followers_count,friends_count,text,user_screen_name,user_id,user_location
0,2020-06-21 13:04:56.400601,2020-06-20 17:25:49+00:00,2020,6,20,5,17,25,1274393028183162881,1274393028183162881,...,milieuzone,3567512000.0,3567512000.0,857.0,171,75,Complimenten aan @JumboSupermarkt #Utrecht #me...,,,
1,2020-06-21 13:04:56.025613,2020-06-21 09:33:32+00:00,2020,6,21,6,9,33,1274636561704005632,1274636561704005632,...,JumboSupermarkt,2797823000.0,2797823000.0,3764.0,16212,1710,@deAZfan De 'Vierde wachtende' regel geldt tij...,,,
2,2020-06-24 17:16:19.427127,2020-06-22 15:14:49+00:00,2020,6,22,0,15,14,1275084834914918407,1275084834914918407,...,,,,858.0,171,75,@UtrechtseSjoerd @GemeenteUtrecht @JumboSuperm...,milieuzone,3567512000.0,"Utrecht, The Netherlands"
3,2020-06-24 17:16:20.207921,2020-06-24 08:38:38+00:00,2020,6,24,2,8,38,1275709910286827520,1275709910286827520,...,,,,,60,126,"Hallo @JumboSupermarkt, ik weet dat iedere cen...",GeeJeeAn,1.088013e+18,


In [72]:
df_concat_query_05.to_csv("../data/processed/query_05_"+TodaysDate+".csv", index = False)

In [73]:
df_concat_query_06 = concat_queries(list_files[19:])
df_concat_query_06

Unnamed: 0,mined_at,created_at,year,month,day,day_of_week,hour,minute,tweet_id,tweet_id_str,...,screen_name,user_tweet_id,user_tweet_id_str,user_favourites_count,followers_count,friends_count,text,user_screen_name,user_id,user_location
0,2020-06-21 13:07:45.190107,2020-06-14 18:37:37+00:00,2020,6,14,6,18,37,1272236769145208833,1272236769145208833,...,vai3333,207973600.0,207973600.0,1999.0,7183,5297,RT @AnimalStill: There comes a moment you find...,,,
1,2020-06-21 13:07:47.134882,2020-06-18 09:59:31+00:00,2020,6,18,3,9,59,1273555936356061186,1273555936356061186,...,ChristienJanson,40452840.0,40452840.0,5630.0,592,1099,"Goh @albertheijn Almere Lavendelplantsoen, gee...",,,
2,2020-06-21 13:07:43.559885,2020-06-18 17:10:57+00:00,2020,6,18,3,17,10,1273664510591733762,1273664510591733762,...,buisman_pro,7.564324e+17,7.564324e+17,133.0,165,1320,#Retail #Innovatie #Covid_19 😷 #AlbertHeijn #C...,,,
3,2020-06-21 13:07:44.989645,2020-06-19 20:43:35+00:00,2020,6,19,4,20,43,1274080408028753920,1274080408028753920,...,wendersinke,381493200.0,381493200.0,2352.0,606,519,"Iedereen heeft wel iets te zeggen, maar soms n...",,,
4,2020-06-21 13:07:43.985991,2020-06-20 06:14:39+00:00,2020,6,20,5,6,14,1274224124836089857,1274224124836089857,...,JeannetteWezen1,1.235878e+18,1.235878e+18,7170.0,175,164,@marutza_mh @albertheijn #Jumbo is selling bre...,,,
5,2020-06-21 13:07:46.918083,2020-06-20 17:25:49+00:00,2020,6,20,5,17,25,1274393028183162881,1274393028183162881,...,milieuzone,3567512000.0,3567512000.0,857.0,171,75,Complimenten aan @JumboSupermarkt #Utrecht #me...,,,
6,2020-06-22 15:40:20.390112,2020-06-22 10:17:30+00:00,2020,6,22,0,10,17,1275010013401165825,1275010013401165825,...,Roxann_Minerals,7.260899e+17,7.260899e+17,,1818,4999,Do Masks Really Protect Against Covid? https:/...,,,
7,2020-06-24 17:16:24.365929,2020-06-22 15:14:49+00:00,2020,6,22,0,15,14,1275084834914918407,1275084834914918407,...,,,,858.0,171,75,@UtrechtseSjoerd @GemeenteUtrecht @JumboSuperm...,milieuzone,3567512000.0,"Utrecht, The Netherlands"
8,2020-06-24 17:16:24.809824,2020-06-24 11:39:37+00:00,2020,6,24,2,11,39,1275755454648520704,1275755454648520704,...,,,,,42,100,Nog een avonddienst en dan 2 weken vakantie. B...,AnjaFijnaut,212649100.0,


In [74]:
df_concat_query_06.to_csv("../data/processed/query_06_"+TodaysDate+".csv", index = False)

# Conclusions

In this notebook we prepared Tweet data from user timeline and queries for analysis (EDA and Sentiment Analysis).

The main goals were:

* Increase the range of dates of the Tweet user timeline data by combining Tweets collected in different dates. 
* Combine all Tweet search results by query.

Since we needed to infer the language of Tweets collected on June 16th, we used this opportunity to evaluate some language detectors. As a result, the following was observed:

1. `detect_language() from TextBlob()` gave us an error . So, it does not seem to be a good choice if you need to detect language for a certain amount of text.

2. `langdetect Python library` was correct in 246 of 256 (96.1%)

3. `langID` was correct in 245 of 256 (95.7%)

4. `TextCat` run slower than the others and presented a very bad result.

From the language detector tested we chose to use `langdetect`, correcting the wrong results manually. We believe that, for larger datasets using multiple language detectors and applying majority code could reduce error. However, we need to keep in mind that automatic language identifiers are very error prone, especially on very short texts. Let's check out all and see what happens.

In the following notebook we will perform some EDA and Sentiment Analysis on the data we've just prepared.