# Project Title: Visualization Project for Power BI

## Introduction
Twitter is one of the famous micro blogging services where user can rea and post messages which are  148 in length. Twitter messages are also called Tweets. This project will categorize tweets into positive, negative ir neutral and also derive insights from them.

### Objective: 
Examine the overall sentiment of twitter users (Nigerians only)  at any given point in time.
Tasks:
  - Extract random twitter data for the last one week, a maximum of 200,000 rows. This data should not have any specific hash tags
- Clean this data by removing columns with at least 70% row nulls, for those with less than 30% rows nulls, replace with ‘Not Available’, for integer type, replace with mean values in the column. 
    Restrict your data to only english
    Extract the Polarity and Subjectivity of each tweet
    Remove what you call stop words from your tweets
    On Power Bi
    - Create a wordcloud of your tweet where stop words have been removed, what are the 20 words people use, what % of these are positive?
  - Create a value measure to see the average polarity and subjectivity of twitter users in the given period, how subjective are these users and how positive/negative are their tweets.
  - Create an hourly chart to see if there is a time or period during the day, twitter users are a bit less aggresive/more aggressive.
  - What percentage of twitter users are negative at least 50% of the time.



In [1]:
#!pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint

## Import necessary libraries

In [1]:
import twint
import re
import pandas as pd
from datetime import datetime
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
import nest_asyncio
import numpy as np

#### Examine the overall sentiment of twitter users (Nigerians only)  at any given point in time.

In [4]:
# List of Nigeria cities to be scrapped
# List of Nigeria cities was gotten from this csv file
cities = pd.read_csv("ng.csv")

In [5]:
new_cities = list(cities.city)

In [6]:
# Sorted the cities alphabetically
cities_sorted = sorted(new_cities)

In [7]:
print(cities_sorted)

['Aba', 'Abagana', 'Abaji', 'Abak', 'Abakaliki', 'Abat', 'Abejukolo', 'Abeokuta', 'Abigi', 'Aboh', 'Aboh', 'Abonnema', 'Abua', 'Abudu', 'Abuja', 'Abuochiche', 'Achalla', 'Adikpo', 'Ado-Ekiti', 'Adogo', 'Afaha Ikot Ebak', 'Afaha Offiong', 'Afam', 'Afikpo', 'Afon', 'Afor-Oru', 'Afuze', 'Agaie', 'Agbani', 'Agbor', 'Agege', 'Agenebode', 'Ago-Amodu', 'Aguata', 'Aguobu-Owa', 'Agwara', 'Ahoada', 'Ajaawa', 'Ajaka', 'Ajalli', 'Ajegunle', 'Ajingi', 'Akamkpa', 'Akanran', 'Akinima', 'Akodo', 'Akpafa', 'Akpet Central', 'Akure', 'Akwanga', 'Akwete', 'Akwukwu-Igbo', 'Albasu', 'Aliade', 'Aliero', 'Alkaleri', 'Amagunze', 'Amaigbo', 'Anaku', 'Anchau', 'Angware', 'Anka', 'Ankpa', 'Apapa', 'Apomu', 'Aramoko-Ekiti', 'Araromi-Opin', 'Argungu', 'Arochukwu', 'Asaba', 'Askira', 'Atan', 'Atani', 'Auchi', 'Augie', 'Auyo', 'Awe', 'Awgu', 'Awka', 'Awo', 'Awo-Idemili', 'Ayete', 'Ayetoro', 'Azare', 'Azare', 'Baap', 'Babban Gida', 'Babura', 'Badagry', 'Bagudo', 'Bagwai', 'Baissa', 'Bajoga', 'Bakori', 'Bakura', 'Bali'

In [8]:
# Tweets wiil be scrapped from 745 towns and cities across Nigeria
print(len(cities_sorted))

745


In [9]:
"""This function loops over the cities_sorted list, 
create a twint object, scrape the tweets for each city and return the tweets
in a Pandas DataFrame"""

def scrapping_cities():
    # set empty dataframe
    out_df = pd.DataFrame()
#     cities_sorted = sorted(cities)
    for city in cities_sorted:
        c = twint.Config()
        # c.Search = "#Naira"
        c.Near = city
        c.Lang = "en"
        c.Pandas = True
        c.Pandas_au = True
        c.Pandas_clean = True
        c.Since = "2022-01-01"
        c.Until = "2022-12-14"
        c.Hide_output = True
        c.Limit = 200000
        c.Count = True
        twint.run.Search(c)
        # get DataFrame
        tweets_df = twint.storage.panda.Tweets_df
        # join Dataframe
        out_df = pd.concat([out_df, tweets_df])
    return out_df      

In [10]:
# Running the scrapping cities function
nest_asyncio.apply()
tweets = scrapping_cities()

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 80 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Succe

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 120 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Succe

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 39 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Succ

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 134 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Success

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 80 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 140 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 60 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Succ

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Success

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 2 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Success

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 39 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 40 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 59 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Succ

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 40 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Succe

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 40 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Succes

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 20 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 7 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 40 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Succes

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 39 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 19 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Successfully collected 0 Tweets.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[+] Finished: Succes

Over 13,000 tweets were scrapped for the whole of the country for the month of December 2022

In [45]:
# Checking the last five rows of the DataFrame
tweets.tail()

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,tweet_without_stopwords
12,1601693353326039040,1601693353326039040,1670708000000.0,2022-12-10 22:40:28,100,,favorites,en,[],[],...,,,,,[],,,,,favorites
14,1601635858616291328,1601635858616291328,1670695000000.0,2022-12-10 18:52:00,100,,dear lord protect atiku abubakar and grant him...,en,[atikuorganizingforaction],[],...,,,,,[],,,,,dear lord protect atiku abubakar grant good he...
17,1601476707134734336,1601476707134734336,1670657000000.0,2022-12-10 08:19:35,100,,you all in invited to pdp kebbi state gubernat...,en,[],[],...,,,,,[],,,,,invited pdp kebbi state gubernatorial race fla...
18,1601468172720541696,1601468172720541696,1670655000000.0,2022-12-10 07:45:40,100,,who is your president,en,[],[],...,,,,,[],,,,,president
19,1601348388527955968,1601348388527955968,1670626000000.0,2022-12-09 23:49:42,100,,once again the eagle has landed in the capital...,en,"[atikuorganizingforaction, atikuokowa2023, ati...",[],...,,,,,[],,,,,eagle landed capital city


In [12]:
# dropping rows which language =! english
tweets["language"].unique()

array(['en', 'zxx', 'ht', 'in', 'qme', 'und', 'qst', 'fr', 'tl', 'es',
       'ja', 'qam', 'it', 'et', 'pl', 'pt', 'tr', 'qht', 'art', 'de',
       'ko', 'ro', 'ur', 'hi', 'lt', 'ne', 'no', 'cy', 'ca', 'hu', 'eu',
       'nl', 'ar', 'sv', 'th', 'is', 'fi', 'el', 'ru', 'da', 'lv', 'fa',
       'zh', 'cs', 'vi'], dtype=object)

In [13]:
# Select rows with Only english tweets
tweets = tweets[tweets["language"].isin(["en"])]

In [14]:
# Only rows that have english has language
tweets["language"].unique()

array(['en'], dtype=object)

In [15]:
# Shape of DataFrame with english rows
tweets.shape

(7824, 38)

In [16]:
# This function cleans the 'tweet' column
def clean_text(text):  
    pat1 = r'@[^ ]+'                   #@signs
    pat2 = r'https?://[A-Za-z0-9./]+'  #links
    pat3 = r'\'s'                      #floating s's
    pat4 = r'\#\w+'                     # hashtags
    pat5 = r'&amp '
    pat6 = r'[^A-Za-z\s]'         #remove non-alphabet
    combined_pat = r'|'.join((pat1, pat2,pat3,pat4,pat5, pat6))
    text = re.sub(combined_pat,"",text).lower()
    return text.strip()

In [17]:
# Applyingg the clean_text function
tweets["tweet"] = tweets["tweet"].apply(clean_text)

In [18]:
tweets.head()

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1602814216775368706,1602233997991485440,1670976000000.0,2022-12-14 00:54:22,100,,or exam to write,en,[],[],...,,,,,,"[{'screen_name': '__babor', 'name': '🎙️𝑳𝒂𝒖𝒓𝒂 𝒕...",,,,
1,1602812769602002944,1602233997991485440,1670975000000.0,2022-12-14 00:48:37,100,,waited hours to reply smh,en,[],[],...,,,,,,"[{'screen_name': '__babor', 'name': '🎙️𝑳𝒂𝒖𝒓𝒂 𝒕...",,,,
2,1602811256402051072,1562722307646189570,1670975000000.0,2022-12-14 00:42:36,100,,these are some existing ways to get the most o...,en,[],[],...,,,,,,"[{'screen_name': 'Godfrey__K', 'name': 'Godfre...",,,,
3,1602810887705870336,1598467348566052868,1670975000000.0,2022-12-14 00:41:09,100,,nice,en,[],[],...,,,,,,[],,,,
4,1602809205399904256,1602806404942888960,1670974000000.0,2022-12-14 00:34:27,100,,pls sir do giveawayim badly in need of cash fo...,en,[],[],...,,,,,,"[{'screen_name': 'seyiaraofficial', 'name': 'O...",,,,


In [19]:
# Drop duplicated tweets, keep the first occurence
tweets = tweets.drop_duplicates(subset=["tweet"], keep="first")

In [20]:
# shape of DataFrame after dropping duplicates 
tweets.shape

(2981, 38)

In [21]:
# Removing stopwords from the tweets and creating a stopwords column
stop = stopwords.words("english")
tweets["tweet_without_stopwords"] = tweets["tweet"].apply(lambda x: " ".join([word for word in x.split() if word not in (stop)]))

In [22]:
tweets.head()

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,tweet_without_stopwords
0,1602814216775368706,1602233997991485440,1670976000000.0,2022-12-14 00:54:22,100,,or exam to write,en,[],[],...,,,,,"[{'screen_name': '__babor', 'name': '🎙️𝑳𝒂𝒖𝒓𝒂 𝒕...",,,,,exam write
1,1602812769602002944,1602233997991485440,1670975000000.0,2022-12-14 00:48:37,100,,waited hours to reply smh,en,[],[],...,,,,,"[{'screen_name': '__babor', 'name': '🎙️𝑳𝒂𝒖𝒓𝒂 𝒕...",,,,,waited hours reply smh
2,1602811256402051072,1562722307646189570,1670975000000.0,2022-12-14 00:42:36,100,,these are some existing ways to get the most o...,en,[],[],...,,,,,"[{'screen_name': 'Godfrey__K', 'name': 'Godfre...",,,,,existing ways get house second half season coo...
3,1602810887705870336,1598467348566052868,1670975000000.0,2022-12-14 00:41:09,100,,nice,en,[],[],...,,,,,[],,,,,nice
4,1602809205399904256,1602806404942888960,1670974000000.0,2022-12-14 00:34:27,100,,pls sir do giveawayim badly in need of cash fo...,en,[],[],...,,,,,"[{'screen_name': 'seyiaraofficial', 'name': 'O...",,,,,pls sir giveawayim badly need cash work


In [23]:
# Save DataFrame as csv file
saved = tweets.to_csv("total_tweets.csv", index= False)

In [24]:
# Load saved csv file
df = pd.read_csv("total_tweets.csv")

### Data Cleaning

Clean this data by removing columns with at least 70% row nulls, for those with less than 30% rows nulls, replace with ‘Not Available’, for integer type, replace with mean values in the column

In [25]:
df.head(2)

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,tweet_without_stopwords
0,1602814216775368706,1602233997991485440,1670976000000.0,2022-12-14 00:54:22,100,,or exam to write,en,[],[],...,,,,,"[{'screen_name': '__babor', 'name': '🎙️𝑳𝒂𝒖𝒓𝒂 𝒕...",,,,,exam write
1,1602812769602002944,1602233997991485440,1670975000000.0,2022-12-14 00:48:37,100,,waited hours to reply smh,en,[],[],...,,,,,"[{'screen_name': '__babor', 'name': '🎙️𝑳𝒂𝒖𝒓𝒂 𝒕...",,,,,waited hours reply smh


In [26]:
# Get sum of null values
sum_null = df.isnull().sum()
sum_null

id                            0
conversation_id               0
created_at                    0
date                          0
timezone                      0
place                      2882
tweet                         1
language                      0
hashtags                      0
cashtags                      0
user_id                       0
user_id_str                   0
username                      0
name                          0
day                           0
hour                          0
link                          0
urls                          0
photos                        0
video                         0
thumbnail                  2386
retweet                       0
nlikes                        0
nreplies                      0
nretweets                     0
quote_url                  2632
search                        0
near                          0
geo                        2981
source                     2981
user_rt_id                 2981
user_rt 

In [27]:
# divide  by total number of values and multiply by 100 to get percentage of missing values
null_percentage = sum_null/ len(df) *100

In [28]:
null_percentage

id                           0.000000
conversation_id              0.000000
created_at                   0.000000
date                         0.000000
timezone                     0.000000
place                       96.678967
tweet                        0.033546
language                     0.000000
hashtags                     0.000000
cashtags                     0.000000
user_id                      0.000000
user_id_str                  0.000000
username                     0.000000
name                         0.000000
day                          0.000000
hour                         0.000000
link                         0.000000
urls                         0.000000
photos                       0.000000
video                        0.000000
thumbnail                   80.040255
retweet                      0.000000
nlikes                       0.000000
nreplies                     0.000000
nretweets                    0.000000
quote_url                   88.292519
search      

In [29]:
# create a list of columns with atleast 70% null values
atleast_70_percent_null = []
for index, value in null_percentage.items():
    if value >= 70:
        atleast_70_percent_null.append(index)


In [30]:
atleast_70_percent_null

['place',
 'thumbnail',
 'quote_url',
 'geo',
 'source',
 'user_rt_id',
 'user_rt',
 'retweet_id',
 'retweet_date',
 'translate',
 'trans_src',
 'trans_dest']

In [31]:
# drop columns with atleast 70% null values
df.drop(columns=(atleast_70_percent_null), axis=1, inplace=True)

In [32]:
df.dtypes

id                           int64
conversation_id              int64
created_at                 float64
date                        object
timezone                     int64
tweet                       object
language                    object
hashtags                    object
cashtags                    object
user_id                      int64
user_id_str                  int64
username                    object
name                        object
day                          int64
hour                         int64
link                        object
urls                        object
photos                      object
video                        int64
retweet                       bool
nlikes                       int64
nreplies                     int64
nretweets                    int64
search                      object
near                        object
reply_to                    object
tweet_without_stopwords     object
dtype: object

In [33]:
# for those with less than 30% rows nulls, replace with ‘Not Available’, for integer type, replace 
# with mean values in the column
# This function checks the datatype of the each column and fill the null values with its mean
def myfillna(series):
    """This function checks the datatype of the column and fill the null values with its mean for integer and float
    datatypes and NOT available for string datatypes"""
    if series.dtype is np.dtype(int):
        return series.fillna(series.mean())
    if series.dtype is np.dtype(float):
        return series.fillna(series.mean())
    if series.dtype is np.dtype("O"):
        return series.fillna("Not Available")
    else:
        return series

In [34]:
# Applying the myfillna function
df = df.apply(myfillna)

In [35]:
# Null values has been treated
df.isna().sum()

id                         0
conversation_id            0
created_at                 0
date                       0
timezone                   0
tweet                      0
language                   0
hashtags                   0
cashtags                   0
user_id                    0
user_id_str                0
username                   0
name                       0
day                        0
hour                       0
link                       0
urls                       0
photos                     0
video                      0
retweet                    0
nlikes                     0
nreplies                   0
nretweets                  0
search                     0
near                       0
reply_to                   0
tweet_without_stopwords    0
dtype: int64

In [36]:
# Extracting the Polarity and Subjectivity of each tweet
print("Running sentiment process")
# creating two new columns(polarity and subjectivity)
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity
def getPolarity(text):
    return TextBlob(text).sentiment.polarity

df["Subjectivity"] = df["tweet_without_stopwords"].apply(getSubjectivity)
df["Polarity"] = df["tweet_without_stopwords"].apply(getPolarity)

Running sentiment process


In [37]:
df[["Polarity", "Subjectivity", "tweet_without_stopwords"]]

Unnamed: 0,Polarity,Subjectivity,tweet_without_stopwords
0,0.000000,0.000000,exam write
1,0.000000,0.000000,waited hours reply smh
2,-0.083333,0.083333,existing ways get house second half season coo...
3,0.600000,1.000000,nice
4,-0.700000,0.666667,pls sir giveawayim badly need cash work
...,...,...,...
2976,0.000000,0.000000,favorites
2977,0.700000,0.600000,dear lord protect atiku abubakar grant good he...
2978,0.000000,0.000000,invited pdp kebbi state gubernatorial race fla...
2979,0.000000,0.000000,president


In [38]:
# Creating a column to show if the tweet is positive negative or neutral
def analysis(score):
    if score < 0:
        return "Negative"
    elif score == 0:
        return "Neutral"
    else:
        return "Positive"
df["Analysis"] = df["Polarity"].apply(analysis)

In [39]:
df[["Polarity","Subjectivity", "tweet_without_stopwords", "Analysis"]].head()

Unnamed: 0,Polarity,Subjectivity,tweet_without_stopwords,Analysis
0,0.0,0.0,exam write,Neutral
1,0.0,0.0,waited hours reply smh,Neutral
2,-0.083333,0.083333,existing ways get house second half season coo...,Negative
3,0.6,1.0,nice,Positive
4,-0.7,0.666667,pls sir giveawayim badly need cash work,Negative


In [40]:
# Saving ghe cleaned dataframe to csv
df.to_csv("clean_tweets.csv", index=False)

In [2]:
df = pd.read_csv("clean_tweets.csv")

In [3]:
df.head(2)

Unnamed: 0,id,conversation_id,created_at,date,timezone,tweet,language,hashtags,cashtags,user_id,...,nlikes,nreplies,nretweets,search,near,reply_to,tweet_without_stopwords,Subjectivity,Polarity,Analysis
0,1602814216775368706,1602233997991485440,1670976000000.0,2022-12-14 00:54:22,100,or exam to write,en,[],[],1132533571661586432,...,0,1,0,,Abagana,"[{'screen_name': '__babor', 'name': '🎙️𝑳𝒂𝒖𝒓𝒂 𝒕...",exam write,0.0,0.0,Neutral
1,1602812769602002944,1602233997991485440,1670975000000.0,2022-12-14 00:48:37,100,waited hours to reply smh,en,[],[],1132533571661586432,...,1,1,0,,Abagana,"[{'screen_name': '__babor', 'name': '🎙️𝑳𝒂𝒖𝒓𝒂 𝒕...",waited hours reply smh,0.0,0.0,Neutral


>### References
https://stackoverflow.com/questions/65902816/removing-stop-words-from-a-pandas-column