# Data exploration and pre-processing

## Data reading 

In [1]:
import re
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
tweets = pd.read_csv("../processed_tweet_data.csv")
tweets[:2]

Unnamed: 0,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,original_author,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place
0,Fri Jun 18 17:55:49 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...","🚨Africa is ""in the midst of a full-blown third...",0.166667,0.188889,en,548,612,ketuesriche,551,351,False,[],"[{'screen_name': 'TelGlobalHealth', 'name': 'T...",Mass
1,Fri Jun 18 17:55:59 +0000 2021,"<a href=""https://mobile.twitter.com"" rel=""nofo...","Dr Moeti is head of WHO in Africa, and one of ...",0.133333,0.455556,en,195,92,Grid1949,66,92,False,[],"[{'screen_name': 'globalhlthtwit', 'name': 'An...","Edinburgh, Scotland"


In [3]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6532 entries, 0 to 6531
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   created_at          6532 non-null   object 
 1   source              6532 non-null   object 
 2   original_text       6532 non-null   object 
 3   polarity            6532 non-null   float64
 4   subjectivity        6532 non-null   float64
 5   lang                6532 non-null   object 
 6   favorite_count      6532 non-null   int64  
 7   retweet_count       6532 non-null   int64  
 8   original_author     6532 non-null   object 
 9   followers_count     6532 non-null   int64  
 10  friends_count       6532 non-null   int64  
 11  possibly_sensitive  3618 non-null   object 
 12  hashtags            6532 non-null   object 
 13  user_mentions       6532 non-null   object 
 14  place               4088 non-null   object 
dtypes: float64(2), int64(4), object(9)
memory usage: 765.6+

In [4]:
print("The number of missing value(s): {}".format(tweets.isnull().sum().sum()))
print("Columons having columns value: {}".format(tweets.columns[tweets.isnull().any()]))

The number of missing value(s): 5358
Columons having columns value: Index(['possibly_sensitive', 'place'], dtype='object')


## Pre-processing

In [5]:
sys.path.append(os.path.abspath(os.path.join('..')))

from clean_tweets_dataframe import Clean_Tweets

In [6]:
ct = Clean_Tweets(tweets)

Automation in Action...!!!


### Drop unwanted columns

In [7]:
print(f"Shape of tweets before droping unwanted tweets {tweets.shape}")
tweets = ct.drop_unwanted_column(tweets)
print(f"Shape of tweets after droping unwanted tweets {tweets.shape}")

Shape of tweets before droping unwanted tweets (6532, 15)
Shape of tweets after droping unwanted tweets (6532, 15)


this indicates all values of retweet_count column are valid bc no row was dropped

### Drop duplicate

In [8]:
print(f"Shape of tweets before droping duplicates tweets {tweets.shape}")
ct.drop_duplicate(tweets)
print(f"Shape of tweets after droping duplicates tweets {tweets.shape}")

Shape of tweets before droping duplicates tweets (6532, 15)
Shape of tweets after droping duplicates tweets (6532, 15)


this indicates there aren't duplicate values in the data

### Convert Created time to date time

In [9]:
print(f"The first row created time before conversion {tweets.created_at[0]}")
tweets = ct.convert_to_datetime(tweets)
print(f"The first row created time after conversion {tweets.created_at[0]}")

The first row created time before conversion Fri Jun 18 17:55:49 +0000 2021
The first row created time after conversion 2021-06-18 17:55:49+00:00


### Convert numeric values to number

In [10]:
tweets = ct.convert_to_numbers(tweets)

### Remove non english tweets


In [11]:
print(f"Shape of tweets before removing non english tweets {tweets.shape}")
tweets = ct.remove_non_english_tweets(tweets)
print(f"Shape of tweets after removing non english tweets {tweets.shape}")

Shape of tweets before removing non english tweets (6532, 17)
Shape of tweets after removing non english tweets (6532, 17)


### Removing Punctuations, Numbers, and Special Characters

In [12]:
tweets['original_text'] = tweets['original_text'].str.replace("[^a-zA-Z#]", " ")
tweets[:2]

Unnamed: 0,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,original_author,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place,friends_count.1,follower_count
0,2021-06-18 17:55:49+00:00,"<a href=""http://twitter.com/download/iphone"" r...",Africa is in the midst of a full blown third...,0.166667,0.188889,en,548,0.166667,ketuesriche,551,351,False,[],"[{'screen_name': 'TelGlobalHealth', 'name': 'T...",Mass,0.166667,0.166667
1,2021-06-18 17:55:59+00:00,"<a href=""https://mobile.twitter.com"" rel=""nofo...",Dr Moeti is head of WHO in Africa and one of ...,0.133333,0.455556,en,195,0.133333,Grid1949,66,92,False,[],"[{'screen_name': 'globalhlthtwit', 'name': 'An...","Edinburgh, Scotland",0.133333,0.133333


### Save cleaned data

In [13]:
tweets.to_csv(r'clean_tweets.csv')

## Data exploration 

### Rows and columns in the dataset

In [14]:
tweets = pd.read_csv("clean_tweets.csv")

In [15]:
print('Count of columns in the data is:  ', len(tweets.columns))
print('Count of rows in the data is:  ', len(tweets))

Count of columns in the data is:   18
Count of rows in the data is:   6532


### Users that made the tweets

In [16]:
tweets.groupby("original_author").size().agg( ['count', 'min', 'max', 'mean', 'median'])

count     5248.000000
min          1.000000
max        530.000000
mean         1.244665
median       1.000000
dtype: float64

From this we can observe the tweets were made by 5248. Most of them had made the tweet once and 
the person with the most tweet had made 530 tweets.

In [17]:
twetter_users = tweets.groupby("original_author").size()
twetter_users.nlargest(5)

original_author
PuneUpdater        530
viralvideovlogs     45
Signal__Pump        27
WHO__India          27
Rosenchild          11
dtype: int64

This indicates the person who made 530 tweets is an outlier.

### Tweet sentiments

In [18]:
def text_category(p):
  if p > 0:
    return "positive"
  elif p < 0:
    return "negative"
  else:
    return "neutral" 

In [19]:
tweets["score"] = tweets["polarity"].apply(text_category)
tweets.groupby("score")["polarity"].count()


score
negative    1277
neutral     1829
positive    3426
Name: polarity, dtype: int64

In [20]:
tweet_sent = tweets.groupby(['score']).sum(
)[["favorite_count", "followers_count", "friends_count"]]
tweet_sent

Unnamed: 0_level_0,favorite_count,followers_count,friends_count
score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,159704,10157572,1278012
neutral,396722,36775025,2520950
positive,1030171,60839857,7355421


We can observe positive tweets are more liked, people whith more followers and freands had made a positive tweet obout covid19.

### Hashtags

## Visualisations

## Save the Data

missing values

### Clean original_text


In [None]:
def clean_tweet(tweet):
    clean_tweet = re.sub("[^a-zA-Z]",  " ",  tweet)
    return clean_tweet


tweets["original_text"] = tweets.original_text.apply(clean_tweet)

### Convert tweet sentiment to category

In [29]:
def text_category(p):
  if p > 0:
    return "positive"
  elif p < 0:
    return "negative"
  else:
    return "neutral"

In [21]:
tweets["polarity"] = tweets["polarity"].apply(text_category)

Unnamed: 0,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,original_author,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place


## Data exploration 

### Tweet languages

In [25]:
tweet_lang = tweets.groupby(['lang']).size()
tweet_lang

lang
en    6532
dtype: int64

### Tweet sentiments

In [27]:
tweet_sent = tweets.groupby(['polarity']).size()
tweet_sent

Unnamed: 0,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,original_author,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place
0,Fri Jun 18 17:55:49 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...","RT @TelGlobalHealth: 🚨Africa is ""in the midst ...",neutral,0.000000,en,,,ketuesriche,551,351,,[],"[{'screen_name': 'TelGlobalHealth', 'name': 'T...",Mass
1,Fri Jun 18 17:55:59 +0000 2021,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @globalhlthtwit: Dr Moeti is head of WHO in...,positive,0.455556,en,,,Grid1949,66,92,,[],"[{'screen_name': 'globalhlthtwit', 'name': 'An...","Edinburgh, Scotland"
2,Fri Jun 18 17:56:07 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...",RT @NHSRDForum: Thank you @research2note for c...,positive,0.483333,en,,,LeeTomlinson8,1195,1176,,"[{'text': 'red4research', 'indices': [103, 116]}]","[{'screen_name': 'NHSRDForum', 'name': 'NHS R&...",
3,Fri Jun 18 17:56:10 +0000 2021,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @HighWireTalk: Former Pfizer VP and Virolog...,positive,0.166667,en,,,RIPNY08,2666,2704,,[],"[{'screen_name': 'HighWireTalk', 'name': 'The ...",
4,Fri Jun 18 17:56:20 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",RT @PeterHotez: I think it’s important that we...,positive,0.766667,en,,,pash22,28250,30819,,[],"[{'screen_name': 'PeterHotez', 'name': 'Prof P...",United Kingdom
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6527,Sat Jun 19 07:41:15 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",RT @Givenkazeni: Zweli please just release the...,neutral,0.400000,en,,,Mthatos_Vivi,447,1089,,[],"[{'screen_name': 'Givenkazeni', 'name': 'le’Gi...",
6528,Sat Jun 19 07:41:26 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",RT @HighWireTalk: Former Pfizer VP and Virolog...,positive,0.166667,en,,,wayno_af007,2224,2739,,[],"[{'screen_name': 'HighWireTalk', 'name': 'The ...","The boro, MA"
6529,Sat Jun 19 07:41:31 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...","@Jenfeds73 @DcrInYYC Respectfully, veterinaria...",positive,0.506250,en,,,dublonothing,3000,4709,,[],"[{'screen_name': 'Jenfeds73', 'name': 'Bubs 🇨🇦...","Los Angeles, CA"
6530,Sat Jun 19 07:41:45 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...","RT @WHOAFRO: ""Africa needs millions more doses...",positive,0.166667,en,,,DrAmirKhanGP,135163,1284,,"[{'text': 'COVID19', 'indices': [120, 128]}]","[{'screen_name': 'WHOAFRO', 'name': 'WHO Afric...",Yorkshire and The Humber


polarity
negative    1216
neutral     2508
positive    2808
dtype: int64

## Visualisations

## Save the Data

missing values

### Clean original_text


In [None]:
def clean_tweet(tweet):
    clean_tweet = re.sub("[^a-zA-Z]",  " ",  tweet)
    return clean_tweet


tweets["original_text"] = tweets.original_text.apply(clean_tweet)

### Convert tweet sentiment to category

In [29]:
def text_category(p):
  if p > 0:
    return "positive"
  elif p < 0:
    return "negative"
  else:
    return "neutral"

In [21]:
tweets["polarity"] = tweets["polarity"].apply(text_category)

Unnamed: 0,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,original_author,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place


## Data exploration 

### Tweet languages

In [25]:
tweet_lang = tweets.groupby(['lang']).size()
tweet_lang

lang
en    6532
dtype: int64

### Tweet sentiments

In [27]:
tweet_sent = tweets.groupby(['polarity']).size()
tweet_sent

Unnamed: 0,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,original_author,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place
0,Fri Jun 18 17:55:49 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...","RT @TelGlobalHealth: 🚨Africa is ""in the midst ...",neutral,0.000000,en,,,ketuesriche,551,351,,[],"[{'screen_name': 'TelGlobalHealth', 'name': 'T...",Mass
1,Fri Jun 18 17:55:59 +0000 2021,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @globalhlthtwit: Dr Moeti is head of WHO in...,positive,0.455556,en,,,Grid1949,66,92,,[],"[{'screen_name': 'globalhlthtwit', 'name': 'An...","Edinburgh, Scotland"
2,Fri Jun 18 17:56:07 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...",RT @NHSRDForum: Thank you @research2note for c...,positive,0.483333,en,,,LeeTomlinson8,1195,1176,,"[{'text': 'red4research', 'indices': [103, 116]}]","[{'screen_name': 'NHSRDForum', 'name': 'NHS R&...",
3,Fri Jun 18 17:56:10 +0000 2021,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @HighWireTalk: Former Pfizer VP and Virolog...,positive,0.166667,en,,,RIPNY08,2666,2704,,[],"[{'screen_name': 'HighWireTalk', 'name': 'The ...",
4,Fri Jun 18 17:56:20 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",RT @PeterHotez: I think it’s important that we...,positive,0.766667,en,,,pash22,28250,30819,,[],"[{'screen_name': 'PeterHotez', 'name': 'Prof P...",United Kingdom
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6527,Sat Jun 19 07:41:15 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",RT @Givenkazeni: Zweli please just release the...,neutral,0.400000,en,,,Mthatos_Vivi,447,1089,,[],"[{'screen_name': 'Givenkazeni', 'name': 'le’Gi...",
6528,Sat Jun 19 07:41:26 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",RT @HighWireTalk: Former Pfizer VP and Virolog...,positive,0.166667,en,,,wayno_af007,2224,2739,,[],"[{'screen_name': 'HighWireTalk', 'name': 'The ...","The boro, MA"
6529,Sat Jun 19 07:41:31 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...","@Jenfeds73 @DcrInYYC Respectfully, veterinaria...",positive,0.506250,en,,,dublonothing,3000,4709,,[],"[{'screen_name': 'Jenfeds73', 'name': 'Bubs 🇨🇦...","Los Angeles, CA"
6530,Sat Jun 19 07:41:45 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...","RT @WHOAFRO: ""Africa needs millions more doses...",positive,0.166667,en,,,DrAmirKhanGP,135163,1284,,"[{'text': 'COVID19', 'indices': [120, 128]}]","[{'screen_name': 'WHOAFRO', 'name': 'WHO Afric...",Yorkshire and The Humber


polarity
negative    1216
neutral     2508
positive    2808
dtype: int64

## Visualisations

## Save the Data