# Data Mining

### RUN Only with COLAB

This cell will setup notebook for running on Google Colab platform.

In [1]:
#!git clone https://FedericoSilvestri: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@github.com/federicosilvestri/data-mining.git

In [5]:
#%cd data-mining

In [1]:
import json
import math
import re
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

from collections import defaultdict
from scipy.stats.stats import pearsonr

import sys
import logging as lg

root = lg.getLogger()
root.setLevel(lg.INFO)

handler = lg.StreamHandler(sys.stdout)
handler.setLevel(lg.DEBUG)
formatter = lg.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
root.addHandler(handler)

  from scipy.stats.stats import pearsonr


## Dataset

Fetching the dataset using our native python functions.

In [2]:
from utils import fetch_dataset

dataset = fetch_dataset()

2022-10-31 14:30:57,029 - root - INFO - Pandas reading dataset tweets.csv...
2022-10-31 14:31:50,550 - root - INFO - Pandas reading dataset users.csv...


# TASK 1.1

Exploring the dataset with analytical tool.

## Overview

### Users

Show `users.csv` information: types of data and columns.

In [3]:
users = dataset['users.csv'].copy() # make a copy

users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11508 entries, 0 to 11507
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              11508 non-null  int64  
 1   name            11507 non-null  object 
 2   lang            11508 non-null  object 
 3   bot             11508 non-null  int64  
 4   created_at      11508 non-null  object 
 5   statuses_count  11109 non-null  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 539.6+ KB


In [4]:
# Display lang values
users['lang'].value_counts()

en                    9970
it                     906
es                     319
pt                      65
en-gb                   50
ru                      42
fr                      36
ja                      33
zh-tw                   17
tr                      14
id                      12
ko                       9
de                       8
nl                       6
en-GB                    4
ar                       3
zh-TW                    3
da                       2
Select Language...       2
en-AU                    1
zh-cn                    1
pl                       1
el                       1
fil                      1
sv                       1
xx-lc                    1
Name: lang, dtype: int64

As we can see, we have:

1. `xx-lc`
2. `Select Language...`

That are not a valid language.
We have decided to use iso639-1 Python library to detect valid languages.

In [5]:
# Display BOT values
# 0 -> it's a human!
# 1 -> it's a bot!
users['bot'].value_counts()

1    6116
0    5392
Name: bot, dtype: int64

As we can see we have clean data for `bot` column.

### Tweets

Show `tweets.csv information: types of data and columns

In [4]:
tweets = dataset['tweets.csv'].copy()

tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13664696 entries, 0 to 13664695
Data columns (total 10 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   id              object
 1   user_id         object
 2   retweet_count   object
 3   reply_count     object
 4   favorite_count  object
 5   num_hashtags    object
 6   num_urls        object
 7   num_mentions    object
 8   created_at      object
 9   text            object
dtypes: object(10)
memory usage: 1.0+ GB


## Data understanding

In these cells we are going to understand and clean the data of two datasets.
The analysis performs:

1. Replacement of null values with median if type is numerical, mode if type is categorical and **outlier** timestamp value if type is datetime.
2. Deletion of rows that has a ratio between valid values and invalid values `< k` where `k` is a param with default value 60%.
3. Understand and replace categorical value based on their domain. For example, the language column contains invalid language codes, and we replace them with the mode value.

# TODO prof recap points for DATA UNDERSTANDING
(last slide of data understanding)
Checklist for Data Understanding
- Determine the quality of the data.(e.g.syntactic accuracy)
- Find outliers. (e. g. using visualization techniques)
- Detect and examine missing values. Possible hidden by default values.
- Discover new or confirm expected dependencies or correlations between attributes.
- Check specific application dependent assumptions (e.g. the attribute follows a normal distribution)
- Compare statistics with the expected behaviour.

### Users

In [61]:
# Constant definition for outlier values
min_date = pd.Timestamp('2006-03-21') # the date when Twitter has started the activity.
max_date = pd.Timestamp('2022-09-28') # the date when dataset has been collected.

# OUTLIER constants
OUTLIER_TIMESTAMP = pd.Timestamp('1800-01-01')

In [8]:
def clean_invalid_rows(df, column_validators, ratio=0.6):
    #
    # This function is a generic function, that performs cleaning of rows that are invalid.
    # We define row as invalid if the ratio between valid and invalid attributes 
    # is greater than `ratio` parameter.
    # 
    # The validation of single attribute is entrusted to the combination of 
    # lambda function named `validator` and the fact that the attribute is nan.
    #
    n_null_items = int(len(column_validators) * ratio)
    rows = []
    for i, row in df.iterrows():
        count = 0
        for head, validator in column_validators:
            if row[head].isnull() or (validator is not None and validator(row[head])):
                count += 1
        if count > n_null_items:
            rows.append(i)
    df.drop(df.index[rows], inplace=True)
    return rows

# Definition of generic lambda validator functions
check_int = lambda label: not bool(re.search(r'^(\d)+(\.0+)?$', str(label))) # checks, using regex if attribute is integer
check_positive_int = lambda label: check_int(label) or float(label) < 0 # checks if label is positive
check_date = lambda label: pd.Timestamp(label) < min_date or pd.Timestamp(label) > max_date # checks timestamps

In [9]:
from langcodes import tag_is_valid # import the library for ISO639-1 codes

# For each column define a validator.
column_validators = [
    ('id', check_int),
    ('name', None),
    ('lang', tag_is_valid),
    ('bot', lambda label: label == '1' or label == '0'),
    ('statuses_count', check_int),
    ('created_at', check_date),
]

#
# Execute the cleaning function.
#
deleted_rows = clean_invalid_rows(users, column_validators)
lg.info(f"Deleted rows: {len(deleted_rows)}")

2022-10-30 20:57:08,938 - root - INFO - Deleted rows: 0


In [10]:
#
# Replacement of invalid names
#
users['name'].replace(np.nan, '', inplace=True)

In [11]:
#
# Explore bot column
#
users['bot'].value_counts()

1    6116
0    5392
Name: bot, dtype: int64

As we can see all values of column bot are 0,1 so we can convert it into boolean field.

In [12]:
# Casting to bool
users = users.astype({'bot': 'bool'})

In [13]:
#
# Replacement of invalid languages
#

# first normalize to lower case all langs
users['lang'] = users['lang'].str.lower()

# calculate the mode for this categorical value
user_lang_mode = users['lang'].mode()[0]

# lambda function for substition
lang_subst_lambda = lambda x: x if tag_is_valid(x) else user_lang_mode

# execute substitution
users['lang'].map(lang_subst_lambda).value_counts()

en       9973
it        906
es        319
pt         65
en-gb      54
ru         42
fr         36
ja         33
zh-tw      20
tr         14
id         12
ko          9
de          8
nl          6
ar          3
da          2
en-au       1
zh-cn       1
pl          1
el          1
fil         1
sv          1
Name: lang, dtype: int64

In [60]:
#
# Define a constant that marks an attribute as an outlier.
#

def filter_datetime(df, att):
    def parse_and_check_datetime(el):
        try:
            datetime = pd.Timestamp(el) # parse datetime as Timestamp
            # checks validity
            if datetime < min_date or datetime > max_date:
                # is an outlier
                return OUTLIER_TIMESTAMP
            else:
                # is not an outlier
                return datetime
        except ValueError:
            # cannot parse as timestamp, it's an outlier
            return OUTLIER_TIMESTAMP

    df[att] = df[att].map(parse_and_check_datetime)

    return df.astype({att: 'datetime64[ns]'})

In [None]:
# Apply the filter to datetime column
users = filter_datetime(users, 'created_at')

In [15]:
#
# Handling the statuses_count column.
#
status_count_median = users['statuses_count'].median()

# replace the null values with median
users['statuses_count'].fillna(status_count_median, inplace=True)

# Convert the column `status_count` to int64 type.
users = users.astype({'statuses_count': 'int64'})

In [16]:
#
# Removing duplicate records.
#
lg.info("Starting removing duplicates")
initial_size = len(users)
users = users.drop_duplicates()
lg.info(f"Number of duplicates = {initial_size - len(users)}")

2022-10-30 20:57:09,096 - root - INFO - Starting removing duplicates
2022-10-30 20:57:09,109 - root - INFO - Number of duplicates = 0


In [17]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11508 entries, 0 to 11507
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   id              11508 non-null  int64         
 1   name            11508 non-null  object        
 2   lang            11508 non-null  object        
 3   bot             11508 non-null  bool          
 4   created_at      11508 non-null  datetime64[ns]
 5   statuses_count  11508 non-null  int64         
dtypes: bool(1), datetime64[ns](1), int64(2), object(2)
memory usage: 550.7+ KB


As we can see all the columns are now validated.

In [18]:
#
# Describe the pre-processed dataset with all columns.
#
users[['name', 'lang', 'bot', 'statuses_count', 'created_at']].describe(include='all', datetime_is_numeric=True)

Unnamed: 0,name,lang,bot,statuses_count,created_at
count,11508,11508,11508,11508.0,11508
unique,11361,24,2,,
top,Sara,en,True,,
freq,7,9970,6116,,
mean,,,,5681.686566,2017-10-03 21:23:16.013121280
min,,,,0.0,2012-01-24 01:57:38
25%,,,,42.0,2017-01-18 09:50:16.500000
50%,,,,68.0,2018-01-30 17:20:36
75%,,,,2520.25,2019-02-25 00:17:30
max,,,,399555.0,2020-04-21 07:28:31


### Tweets

In [41]:
#
# Define a function for validating text of tweet.
#
check_text = lambda x: not x or len(str(x)) <= 0

In [42]:
column_validators = [
    ('id', check_positive_int),
    ('user_id', check_positive_int),
    ('retweet_count', check_positive_int),
    ('reply_count', check_positive_int),
    ('favorite_count', check_positive_int),
    ('num_hashtags', check_positive_int),
    ('num_urls', check_positive_int),
    ('num_mentions', check_positive_int),
    ('created_at', check_date),
    ('text', check_text),
]

# clean the dataset using validators ratio function.
lg.info("Starting dataset cleaning with validators...")
deleted_rows = clean_invalid_rows(tweets, column_validators)
lg.info(f"Deleted rows {len(deleted_rows)}")

2022-10-31 10:40:50,728 - root - INFO - Starting dataset cleaning with validators...
2022-10-31 11:21:23,251 - root - INFO - Deleted rows 16490


We have decided to remove the `id` column because it's not relevant to our analysis.

In [5]:
#
# Dropping id column
#
tweets = tweets.drop('id', axis=1)

#### Analyze all columns

The followings cells perform analysis on type and convert invalid type in an OUTLIER_VALUE.

In [55]:
#
# Take `retweet_count` column as an example.
#
for col in tweets.columns:
    if col == 'text':
        #
        # skip the text column
        #
        continue
    lg.info(tweets[col].value_counts())

2022-10-31 11:32:54,771 - root - INFO - -1            431004
497404180       4600
7004532         4580
157029836       4578
1693274954      4572
               ...  
5915               1
605294402          1
2350               1
2583               1
261                1
Name: user_id, Length: 12171, dtype: int64
2022-10-31 11:32:55,640 - root - INFO - 0             9419279
1             1171971
2              357074
3              180180
4              111519
               ...   
qzf                 1
zl0v                1
lsgc4               1
tej9sl6m0           1
rf24duucpb          1
Name: retweet_count, Length: 228814, dtype: int64
2022-10-31 11:32:56,347 - root - INFO - 0             11790440
0.0            1042490
1                15670
2                 1454
1.0               1435
                ...   
gmo82                1
brs                  1
wh9v7rkff3           1
zszac                1
otsduzr              1
Name: reply_count, Length: 158619, dtype: int64
2022-10-31 11

As we can see we have a lot of invalid values, hence we need to replace them.

In [6]:
# Define a simple function that replaces invalid values with an outlier value
def replace_with_outlier(dataset, col_name, check_function, outlier_value):
    df = dataset.copy()
    v = df[col_name].map(check_function)
    record_touched = len(v) - sum(v)
    df.loc[v == False, col_name] = df[v == False][col_name].apply(lambda x: outlier_value)
    return df, record_touched

In [7]:
# check function for integer values
def check_integer_column(x):
    try:
        # we try to cast to int
        int(str(x))
        return True
    except ValueError:
        return False

In [12]:
# define all columns to be checked
INTEGER_COLUMNS = [
    'user_id',
    'retweet_count',
    'reply_count',
    'favorite_count',
    'num_hashtags',
    'num_urls',
    'num_mentions',
]

# outlier value
OUTLIER_VALUE = -1

In [13]:
#
# Replace invalid integer columns
#
for col in INTEGER_COLUMNS:
    tweets, removed = replace_with_outlier(
        tweets,
        col,
        check_integer_column,
        OUTLIER_VALUE,
    )
    lg.info(f"Detected {removed} {col} with invalid value, i.e. {removed / len(tweets) * 100}% of dataset")

2022-10-31 14:35:35,990 - root - INFO - Detected 0 user_id with invalid value, i.e. 0.0% of dataset
2022-10-31 14:35:45,264 - root - INFO - Detected 0 retweet_count with invalid value, i.e. 0.0% of dataset
2022-10-31 14:35:53,338 - root - INFO - Detected 0 reply_count with invalid value, i.e. 0.0% of dataset
2022-10-31 14:36:06,502 - root - INFO - Detected 1853922 favorite_count with invalid value, i.e. 13.567239256548408% of dataset
2022-10-31 14:36:18,503 - root - INFO - Detected 1854130 num_hashtags with invalid value, i.e. 13.568761427257511% of dataset
2022-10-31 14:36:30,111 - root - INFO - Detected 1853914 num_urls with invalid value, i.e. 13.567180711521134% of dataset
2022-10-31 14:36:40,845 - root - INFO - Detected 988054 num_mentions with invalid value, i.e. 7.2307060471744125% of dataset


For numerical columns, replace with median.

In [14]:
# Define a simple function that replaces missing values with the median (only numerical)
def clean_with_median(dataset, col_name):
    df = dataset.copy()
    v = df[col_name].map(lambda x: x != OUTLIER_VALUE)
    median = df[v == True][col_name].median()
    df.loc[v == False, col_name] = df[v == False][col_name].apply(lambda x: median)
    
    return df, sum(~v)


In [15]:
# Replacing missing data with median
for col in INTEGER_COLUMNS:
    tweets, affected = clean_with_median(tweets, col)
    lg.info(f'Detected {affected} rows with outlier value i.e. {affected / len(tweets) * 100}% of dataset')

2022-10-31 14:37:35,693 - root - INFO - Detected 434013 rows with outlier value i.e. 3.1761628652404705% of dataset
2022-10-31 14:37:46,931 - root - INFO - Detected 625542 rows with outlier value i.e. 4.577796681316584% of dataset
2022-10-31 14:37:57,852 - root - INFO - Detected 1853912 rows with outlier value i.e. 13.567166075264318% of dataset
2022-10-31 14:38:08,988 - root - INFO - Detected 1853922 rows with outlier value i.e. 13.567239256548408% of dataset
2022-10-31 14:38:19,971 - root - INFO - Detected 1854130 rows with outlier value i.e. 13.568761427257511% of dataset
2022-10-31 14:38:31,098 - root - INFO - Detected 1853914 rows with outlier value i.e. 13.567180711521134% of dataset
2022-10-31 14:38:42,264 - root - INFO - Detected 988054 rows with outlier value i.e. 7.2307060471744125% of dataset


Managing the text column, we want to make the column a string type.

In [68]:
#
# Compute statistics
#
invalid_texts = tweets['text'].map(pd.isnull)
lg.info(f"Found {sum(invalid_texts)} records, i.e. {sum(invalid_texts) / len(invalid_texts) * 100}% of dataset")
# to optize the memory
del invalid_texts

2022-10-31 15:10:55,925 - root - INFO - Found 0 records, i.e. 0.0% of dataset


In [63]:
# Handle the text record
def handle_text_record(x):
    if pd.isnull(x):
        return ''
    else:
        return str(x)

In [69]:
# Execute the function
tweets['text'] = tweets['text'].map(handle_text_record)

Unnamed: 0,user_id,retweet_count,reply_count,favorite_count,num_hashtags,num_urls,num_mentions,created_at,text
0,327746321,0,0,0,0,0,0,2019-09-11 14:53:55,"If man is a little lower than angels, then ang..."
1,333722906,1,0,0,0,0,1,2020-04-01 20:27:04,"""@BestWSHHVids: how do you say these words wit..."
2,2379755827,0,0,0,0,0,1,2019-05-02 13:34:31,@LOLatComedy awsome
3,466226882,0,0,0,0,0,0,2019-11-04 07:17:37,Stephen Hawkins: i buchi neri non esistono se ...
4,1355537995,114,0,0,1,0,1,2020-03-11 16:45:31,RT @tibbs_montris: So ready for Wednesday!
...,...,...,...,...,...,...,...,...,...
13664690,220933018,0,0,0,0,0,0,2018-05-04 05:29:33,ESTA MANANA AUN ESTA MUY FRIO ! MIREN ESTO ! ...
13664691,587491046,0,0,0,0,0,1,2020-04-17 02:51:53,"@warriors Congrats, maybe I'll be able to get ..."
13664693,91781300,0,0.0,0.0,0.0,0.0,0.0,2016-07-10 22:43:09,
13664694,127895572,0,0,1,1,0,0,2019-03-07 19:56:55,Shooting crew of porn movies. #TheWorstJobToHave


In [62]:
#
# Managing the Datetime column
#
tweets = filter_datetime(tweets, 'created_at')

In [70]:
#
# Casting all dataset
#
tweets = tweets.astype({
    'user_id': 'int64',
    'retweet_count': 'int64',
    'reply_count': 'int64',
    'favorite_count': 'int64',
    'num_hashtags': 'int64',
    'num_urls': 'int64',
    'num_mentions': 'int64',
    'created_at': 'datetime64[ns]',
    'text': 'string',
})

In [74]:
#
# Printing statistics about cleaning
#

initial_ds_len = len(tweets)
tweets = tweets.drop_duplicates()
lg.info(f'Removed {initial_ds_len - len(tweets)} duplicates record that are {(initial_ds_len - len(tweets)) / initial_ds_len * 100}% of dataset.')
del initial_ds_len

2022-10-31 15:21:02,911 - root - INFO - Removed 0 duplicates record that are 0.0% of dataset.


In [75]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11249771 entries, 0 to 13664695
Data columns (total 9 columns):
 #   Column          Dtype         
---  ------          -----         
 0   user_id         int64         
 1   retweet_count   int64         
 2   reply_count     int64         
 3   favorite_count  int64         
 4   num_hashtags    int64         
 5   num_urls        int64         
 6   num_mentions    int64         
 7   created_at      datetime64[ns]
 8   text            string        
dtypes: datetime64[ns](1), int64(7), string(1)
memory usage: 858.3 MB


In [30]:
tweets.describe()

  tweets.describe()


Unnamed: 0,id,user_id,retweet_count,reply_count,favorite_count,num_hashtags,num_urls,num_mentions,created_at,text
count,11265103,11265103.0,11265103.0,11265103.0,11265103.0,11265103.0,11265103.0,11265103.0,11265103,11265103.0
unique,11242504,12170.0,45814.0,645.0,1570.0,312.0,450.0,394.0,8029865,6778479.0
top,dpy,491630583.0,0.0,0.0,0.0,0.0,0.0,0.0,1800-01-01 00:00:00,
freq,10,3929.0,7771921.0,9841583.0,7903049.0,8739113.0,8259047.0,6253434.0,97605,406989.0
first,,,,,,,,,1800-01-01 00:00:00,
last,,,,,,,,,2020-05-03 10:36:12,


In [34]:
# 
# Save the CSV
#
tweets.to_csv('tweets_cleaned.csv')
users.to_csv('users_cleaned.csv')

### Correlation

In [None]:
tweets.corr()

## Distribution

In [None]:
def build_grid_plot(configs):
    cols = 2 if len(configs) <= 4 else 3
    rows = math.ceil(len(configs) / cols)
    fig_dims = (rows, cols)
    fig = plt.figure(figsize=(20, rows * 5))
    fig.subplots_adjust(hspace=0.2, wspace=0.2)

    for i, config in enumerate(configs):
        if i == len(configs) - 1 and len(configs) % cols == 1 and cols % 2 == 1:
            plt.subplot2grid(fig_dims, (i // cols, cols // 2))
        else:
            plt.subplot2grid(fig_dims, (i // cols, i % cols))
        if config['type'] == 'hist':
            config['column'].hist(bins=int(math.log2(len(config['column'])) + 1))
            plt.title(config['title'])
        elif config['type'] == 'bar':
            config['column'].value_counts().plot(kind='bar', title=config['title'])
            if ('rotation' in config) and config['rotation']:
                plt.xticks(rotation=0)
        elif config['type'] == 'boxplot':
            config['df'].boxplot(column=config['columns'])
    plt.show()

In [None]:
configs = [
    {
        'type': 'hist',
        'column': tweets['retweet_count'],
        'title': 'Retweet Counts'
    },
    {
        'type': 'hist',
        'column': tweets['reply_count'],
        'title': 'Replay Counts',
    },
    {
        'type': 'hist',
        'column': tweets['favorite_count'],
        'title': 'Favorite Counts'
    },
    {
        'type': 'hist',
        'column': tweets['num_hashtags'],
        'title': 'Hashtag Counts'
    },
    {
        'type': 'hist',
        'column': tweets['num_urls'],
        'title': 'Url Counts'
    },
    {
        'type': 'hist',
        'column': tweets['num_mentions'],
        'title': 'Mentions Counts'
    },
    {
        'type': 'hist',
        'column': tweets['created_at'],
        'title': 'Tweets Creation Date Distribution'
    }
]

build_grid_plot(configs=configs)

In [None]:
configs = [
    {
        'type': 'hist',
        'column': users['statuses_count'],
        'title': 'Statues Counts'
    },
    {
        'type': 'bar',
        'column': users['bot'].map(lambda v: 'Bot' if v else 'User'),
        'title': 'Bot and User Counts',
        'rotation': True
    },
    {
        'type': 'bar',
        'column': users['lang'],
        'title': 'Languages Counts'
    },
    {
        'type': 'hist',
        'column': users['created_at'],
        'title': 'User Creation Date Distribution'
    }
]

build_grid_plot(configs=configs)

### Outlier detection

In [None]:
def replace_outliers(df, column_name, threshold):
    column = df[column_name]
    to_replace = len(column[column > threshold])
    perc_to_replace = (to_replace / len(column) * 100)
    lg.info(f'{to_replace} ({perc_to_replace:.2f}%) element replaced for column {column_name}')
    median = column.median()
    df[column_name] = column.map(lambda x: median if x > threshold else x)

In [None]:
def boxplot_tweets_show():
    configs = [
        {
            'type': 'boxplot',
            'df': tweets,
            'columns': ['retweet_count']
        },
        {
            'type': 'boxplot',
            'df': tweets,
            'columns': ['reply_count']
        },
        {
            'type': 'boxplot',
            'df': tweets,
            'columns': ['favorite_count']
        },
        {
            'type': 'boxplot',
            'df': tweets,
            'columns': ['num_hashtags']
        },
        {
            'type': 'boxplot',
            'df': tweets,
            'columns': ['num_urls']
        },
        {
            'type': 'boxplot',
            'df': tweets,
            'columns': ['num_mentions']
        },
    ]

    build_grid_plot(configs=configs)

In [None]:
boxplot_tweets_show()

In [None]:
replace_outliers(tweets, 'retweet_count', 6e5)
replace_outliers(tweets, 'reply_count', 6e4)
replace_outliers(tweets, 'favorite_count', 1.2e5)
replace_outliers(tweets, 'num_hashtags', 1e4)
replace_outliers(tweets, 'num_urls', 1e4)
replace_outliers(tweets, 'num_mentions', 1e5)

boxplot_tweets_show()

In [None]:
plt.figure(figsize=(20, 10))
tweets.plot.scatter(x='reply_count', y='favorite_count')
plt.show()

# TASK 1.2

## Data preparation

- How many tweets were published by the user?
- How many tweets are published by the user in a given period of time?
- Total number of tweets
- Total number of likes and comments
- Ratio between the number of tweets and the number of likes
- Entropy of the user
- Average length of the tweets per user
- Average number of special characters in the tweets per user

In [None]:
column_name = 'tweets_num'
tweets_grouped_by_users = tweets.groupby(['user_id']).size()
users[column_name] = tweets_grouped_by_users

users[column_name].replace(np.nan, 0, inplace=True)
users = users.astype({column_name: 'int64'})

users

In [None]:
tweets_greater_2020 = tweets[tweets['created_at'] >= pd.Timestamp('2020-01-01')]
tweets_filtered_2020 = tweets_greater_2020[tweets['created_at'] < pd.Timestamp('2021-01-01')]
tweets_grouped_2020 = tweets_filtered_2020.groupby(['user_id']).size()

column_name = 'tweets_2020_num'
users[column_name] = tweets_grouped_2020

users[column_name].replace(np.nan, 0, inplace=True)
users = users.astype({column_name: 'int64'})

users

In [None]:
column_name = 'tweets_total_num'
users[column_name] = users['statuses_count'] + users['tweets_num']

users[column_name].replace(np.nan, 0, inplace=True)
users = users.astype({column_name: 'int64'})

users

In [None]:
column_name = 'likes_num'
tweets_grouped_likes = tweets.rename(columns={'favorite_count': column_name}).groupby(['user_id'])[column_name].sum()

users = users.join(tweets_grouped_likes, on='id')

users[column_name].replace(np.nan, 0, inplace=True)
users = users.astype({column_name: 'int64'})

users

In [None]:
column_name = 'comments_num'
tweets_grouped_comments = tweets.rename(columns={'reply_count': column_name}).groupby(['user_id'])[column_name].sum()

users = users.join(tweets_grouped_comments, on='id')

users[column_name].replace(np.nan, 0, inplace=True)
users = users.astype({column_name: 'int64'})

users

In [None]:
column_name = 'ratio_tweets_likes'
users[column_name] = users['tweets_total_num'] / users['likes_num']

users[column_name].replace(np.nan, 0, inplace=True)
users = users.astype({column_name: 'float64'})

users

In [None]:
column_name = 'entropy'
avg_tweets_total_num = users['tweets_total_num'] / users['tweets_total_num'].sum()
users[column_name] = - (avg_tweets_total_num * np.log(avg_tweets_total_num))

users[column_name].replace(np.nan, 0, inplace=True)
users = users.astype({column_name: 'float64'})

users

In [None]:
column_name = 'texts_total_length'
tmp_tweets = tweets.rename(columns={'text': column_name})
tmp_tweets[column_name] = tmp_tweets[column_name].map(lambda t: len(t))
tweets_grouped_comments = tmp_tweets.groupby(['user_id'])[column_name].sum()

users = users.join(tweets_grouped_comments, on='id')

users[column_name].replace(np.nan, 0, inplace=True)
users = users.astype({column_name: 'int64'})

users

In [None]:
column_name = 'texts_special_chars_length'
tmp_tweets = tweets.rename(columns={'text': column_name})

def count_special_chars(text):
    count = 0
    for ch in text:
        if not ch.isalpha() and not ch.isdigit():
            count += 1
    return count

tmp_tweets[column_name] = tmp_tweets[column_name].map(count_special_chars)
tweets_grouped_comments = tmp_tweets.groupby(['user_id'])[column_name].sum()

users = users.join(tweets_grouped_comments, on='id')

users[column_name].replace(np.nan, 0, inplace=True)
users = users.astype({column_name: 'int64'})

users

the team has to explore the new features for a statistical analysis (distributions, outliers, visualizations, correlations)

In [None]:
configs = [
    {
        'type': 'hist',
        'column': users['tweets_num'],
        'title': 'Tweets num'
    },
    {
        'type': 'hist',
        'column': users['tweets_2020_num'],
        'title': 'Tweets 2020 num'
    },
    {
        'type': 'hist',
        'column': users['tweets_total_num'],
        'title': 'Tweets total num'
    },
    {
        'type': 'hist',
        'column': users['likes_num'],
        'title': 'Likes num'
    },
    {
        'type': 'hist',
        'column': users['comments_num'],
        'title': 'Comments num'
    },
    #{ # TODO GERE: inf values, understand how plot it
    #    'type': 'hist',
    #    'column': users['ratio_tweets_likes'],
    #    'title': 'Ratio tweets likes'
    #},
    {
        'type': 'hist',
        'column': users['entropy'],
        'title': 'Entropy'
    },
    {
        'type': 'hist',
        'column': users['texts_total_length'],
        'title': 'Texts total length'
    },
    {
        'type': 'hist',
        'column': users['texts_special_chars_length'],
        'title': 'Texts special chars length'
    },
]

build_grid_plot(configs=configs)

In [None]:
def boxplot_tweets_newfeatures_show():
    configs = [
        {
            'type': 'boxplot',
            'df': users,
            'columns': ['tweets_num']
        },
        {
            'type': 'boxplot',
            'df': users,
            'columns': ['tweets_2020_num']
        },
        {
            'type': 'boxplot',
            'df': users,
            'columns': ['tweets_total_num']
        },
        {
            'type': 'boxplot',
            'df': users,
            'columns': ['likes_num']
        },
        {
            'type': 'boxplot',
            'df': users,
            'columns': ['comments_num']
        },
        #{
        #    'type': 'boxplot',
        #    'df': users,
        #    'columns': ['ratio_tweets_likes']
        #},
        {
            'type': 'boxplot',
            'df': users,
            'columns': ['entropy']
        },
        {
            'type': 'boxplot',
            'df': users,
            'columns': ['texts_total_length']
        },
        {
            'type': 'boxplot',
            'df': users,
            'columns': ['texts_special_chars_length']
        },
    ]

    build_grid_plot(configs=configs)

boxplot_tweets_newfeatures_show()

In [None]:
users.corr()