### About the Dataset
#### Kaggle dataset : [Download The Dataset Here ](https://www.kaggle.com/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows/discussion/310030)

#### Description
This dataset contains 1.2M distinct tweets about the current ongoing Ukraine-Russia conflict.

Implementation
Two Jupyter notebooks running 24/7, executing every 15 mins monitoring hashtags pertaining to the ongoing Ukraine-Russia conflict. The dataset creator also implemented a simple "hashtag crawler" where the dataset creator crawled the top-most hashtags from an initial set of hashtags thus getting the other related hashtags at a given point in time.

#### Import dependencies

In [1]:
import pandas as pd
import numpy as np
import re
import datetime as dt

#### Load the csv dataset into a dataframe

In [2]:
tweets_df = pd.read_csv('resources/UkraineCombinedTweetsDeduped20220227-131611.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
tweets_df.columns

Index(['Unnamed: 0', 'userid', 'username', 'acctdesc', 'location', 'following',
       'followers', 'totaltweets', 'usercreatedts', 'tweetid',
       'tweetcreatedts', 'retweetcount', 'text', 'hashtags', 'language',
       'coordinates', 'favorite_count', 'extractedts'],
      dtype='object')

#### Create a data set of English tweets

In [4]:
# Filter English laguage tweets from the data (relevant to this project)
en_tweets_df = tweets_df[tweets_df['language']=='en']
len(en_tweets_df)

723256

### Data Cleaning

In [5]:
# Count rows with no location
en_tweets_df['location'].isna().sum()

282767

In [6]:
# Drop rows where location is NaN or blank
en_tweets_df = en_tweets_df[en_tweets_df['location'].notna()]
len(en_tweets_df)

440489

In [7]:
# count rows with no coordinates
en_tweets_df['coordinates'].isna().sum()

440446

In [8]:
# Drop all unnecessary columns: userid, acctdesc, usercreatedts, language, favorite_count, extractedts, coordinates 
tweets_df = en_tweets_df[['tweetid','username','retweetcount', 'favorite_count','location','tweetcreatedts','text']]
tweets_df.head(2)

Unnamed: 0,tweetid,username,retweetcount,favorite_count,location,tweetcreatedts,text
5,1496738676335734785,Areopagiet,1552,0,EU v2.0,2022-02-24 06:48:03.000000,A cruise missile fired by the Russian army fel...
13,1496738678332280834,Sicarius130,1032,0,Hong Kong,2022-02-24 06:48:03.000000,"SPREAD AND SHARE, YOU CAN HELP UKRAINE #Ukrain..."


In [9]:
tweets_df['location'].value_counts().head(50)

United States              7160
India                      6407
London, England            3112
USA                        2788
United Kingdom             2608
Canada                     2565
New Delhi, India           2492
London                     2369
California, USA            2337
Lagos, Nigeria             2091
Los Angeles, CA            1900
Washington, DC             1865
Earth                      1838
England, United Kingdom    1768
Ukraine                    1714
Nigeria                    1675
Texas, USA                 1582
UK                         1526
New York, USA              1482
Florida, USA               1478
Pakistan                   1405
Australia                  1386
New York, NY               1311
Mumbai, India              1303
Chicago, IL                1198
Toronto, Ontario           1155
Nairobi, Kenya             1145
Germany                    1139
Hong Kong                   966
France                      960
Europe                      936
Paris, F

In [10]:
#create a copy to compare calculated country vs. location values
tweets_country_df = tweets_df.copy()

In [11]:
# create country column based on top 50 values from location column

country_dicts = [{'india':['India','New Delhi','Mumbai']},
               {'ukraine':['Kyiv','ukraine','Kharkiv','Odessa','Donetsk','Україна']},
               {'canada':['Canada','Ontario']},
               {'nigeria':['Nigeria']},
               {'pakistan':['Pakistan']},
               {'russia':['Russia','Россия']},
               {'germany':['Germany','Deutschland']},
               {'france':['France']},
               {'poland':['Poland','Polska','Warsaw','Krakow','lodz','Wroclaw','Poznan','Gdansk']},
               {'australia':['Australia', 'Sydney']},
               {'china':['China', '俄罗斯','俄羅斯']},
               {'usa':['USA','United States','Los Angeles','Washington','Las Vegas','Chicago','New York','Houston','Seattle','Texas','Dallas','Atlanta']},
               {'uk':['UK','England','London','United Kingdom','Liverpool','Ireland']},
               {'nz':['NZ','New Zealand']}]


def find_country(location,country_dicts):   
    for country_dict in country_dicts:
        for country_name,city_list in country_dict.items():
            for city_name in city_list:
                # this needs a regular expression use the same re library as below
                if city_name.lower() in location.lower():
                    return country_name
    return np.nan

# apply function to dataframe
tweets_country_df['country'] = tweets_country_df['location'].apply(lambda x: find_country(x, country_dicts))

In [12]:
# We need a regular expression for USA, UK, NZ, that uses beginning or end of text, like ^ and $
# seeing lots of results for 'nz' like 'tanzania', 'shenzhen', 'zanzibar', 'firenze', 'mzanzi','denzoko'
# seeing various results for 'uk' like 'timbuktu', 'phuket'
# Seeing results like 'lampedusa' for 'usa'

tweets_country_df['country'].value_counts().head(20)

usa          52509
uk           31361
india        29683
canada       10370
nigeria       6713
germany       6491
pakistan      5733
ukraine       5099
poland        4877
australia     4429
france        4087
nz            1514
russia        1040
china          500
Name: country, dtype: int64

In [13]:
# Create a new dataframe where country column is not na
tweets_country_df = tweets_country_df[tweets_country_df['country'].notna()]
tweets_country_df.shape

(164406, 8)

In [14]:
# reset index of the new df
tweets_country_df.reset_index(drop=True,inplace=True)
tweets_country_df.tail(2)

Unnamed: 0,tweetid,username,retweetcount,favorite_count,location,tweetcreatedts,text,country
164404,1497802305369825284,chinaorgcn,0,0,"Beijing, China",2022-02-27 05:14:32.000000,U.S. Secretary of State Antony #Blinken announ...,china
164405,1497802306527842307,mohitIndia143,0,0,"Rajasthan, India",2022-02-27 05:14:32.000000,NEW - #NATO Allies boost support to #Ukraine 🇺...,india


In [15]:
# clean the date column by changing the date format to date object
tweets_country_df['tweetcreatedts'] = pd.to_datetime(tweets_country_df['tweetcreatedts']).dt.date
tweets_country_df.tail(2)

Unnamed: 0,tweetid,username,retweetcount,favorite_count,location,tweetcreatedts,text,country
164404,1497802305369825284,chinaorgcn,0,0,"Beijing, China",2022-02-27,U.S. Secretary of State Antony #Blinken announ...,china
164405,1497802306527842307,mohitIndia143,0,0,"Rajasthan, India",2022-02-27,NEW - #NATO Allies boost support to #Ukraine 🇺...,india


In [16]:
tweets_country_df.drop(['location'], axis=1, inplace=True)

In [17]:
# save the above dataframe as a csv
tweets_country_df.to_csv('resources/cleaned_data.csv', index=False)

### Tweeted text cleaning function to perform following operations :

- convert all text to lowercase
- remove mentions
- remove hashtags
- remove hyperlinks
- remove punctuations and special characters 


In [18]:
import re

def text_cleaning(df, column_name):

# convert all text to lower case
    df[column_name] = df[column_name].str.lower().to_frame() # returns a series object
    df[column_name] = df[column_name].str.replace(r"@[A-Za-z0-9_']+","", regex = True).to_frame() # remove mentions
    df[column_name] = df[column_name].str.replace(r"#[A-Za-z0-9_]+","", regex=True).to_frame() # remove hashtags
    df[column_name] = df[column_name].str.replace(r"http\S+|www.\S+","",regex=True).to_frame() # remove hyperlinks
    df[column_name] = df[column_name].str.replace(r"[^\w\s]|_|\d+|[^a-zA-Z]"," ",regex=True).to_frame() # remove punctuations

    return df

#### Cleaned tweets df preview

In [36]:
# clean the tweets using function
cleaned_tweets_df = tweets_country_df.copy()
cleaned_tweets_df = cleaned_tweets_df[['text']]
cleaned_tweets_df = text_cleaning(cleaned_tweets_df,'text')
cleaned_tweets_df.tail(2)

Unnamed: 0,text
164404,u s secretary of state antony announced satu...
164405,new allies boost support to s...


### Data Cleaning Complete


#### NOTE: (Add to final report)

- News Tweets are about 1% of the entire dataset and hence assumed to not affect the analysis results. 

- Also, it is difficult to distinguish between the actual news tweets and private accounts sharing news.

### Vader Sentiment Analysis

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under the [MIT License] 

In [37]:
# Import dependencies, modules required for sentiment analysis from vader package which is 
# Previously installed in the python environment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

In [38]:
# Generate sentiment for all the sentnces present in the data set
def sentiment_scores(df, column_name):
    sentiment_score_list=[]
    for row in df[column_name]:
        vader_sentiment = analyzer.polarity_scores(row)
        sentiment_score_list.append(vader_sentiment)
    return pd.DataFrame(sentiment_score_list)
    
# Let us create sentiment dataframe for tweets using above function
sentiments_df = sentiment_scores(cleaned_tweets_df,'text')
sentiments_df.tail(10)

Unnamed: 0,neg,neu,pos,compound
164396,0.0,0.816,0.184,0.4019
164397,0.047,0.862,0.091,0.4588
164398,0.0,1.0,0.0,0.0
164399,0.18,0.599,0.221,0.296
164400,0.242,0.758,0.0,-0.4939
164401,0.0,1.0,0.0,0.0
164402,0.0,0.743,0.257,0.6369
164403,0.27,0.73,0.0,-0.743
164404,0.072,0.718,0.21,0.5574
164405,0.0,0.859,0.141,0.6597


In [39]:
# Concatenate ALL tweet data with sentiment scores into a new dataframe
tweet_sentiments_df = pd.concat([tweets_country_df,sentiments_df],axis=1)
cleaned_tweets_df.rename(columns = {'text':'cleaned_text'}, inplace = True)
tweet_sentiments_df = pd.concat([tweet_sentiments_df,cleaned_tweets_df],axis=1)

In [40]:
#tweet_sentiments_df = pd.concat([cleaned_tweets_df.reset_index(drop=True),sentiments_df],axis=1)
tweet_sentiments_df.head(10)

Unnamed: 0,tweetid,username,retweetcount,favorite_count,tweetcreatedts,text,country,neg,neu,pos,compound,cleaned_text
0,1496738679997542403,EuromaidanPR,28,109,2022-02-24,The world must act immediately.- #Ukraine is a...,ukraine,0.226,0.73,0.044,-0.8173,the world must act immediately is at stake ...
1,1496738680203055109,bilalasghar778,29,0,2022-02-24,The historic moment when the PM of Pakistan Mr...,pakistan,0.0,0.932,0.068,0.4019,the historic moment when the pm of pakistan mr...
2,1496738689451380736,gon_deedee,2,0,2022-02-24,Real #Americans stand #UnitedWithBiden AGAINST...,usa,0.0,1.0,0.0,0.0,real stand against against
3,1496738693675028484,zohrathought,8,0,2022-02-24,"Voices from #Russia: ""Waking up to the news, m...",nz,0.318,0.682,0.0,-0.9562,voices from waking up to the news many rus...
4,1496738694463643652,tyfacts12,119,0,2022-02-24,BREAKING: Over 800 Ukrainian military casualti...,usa,0.0,1.0,0.0,0.0,breaking over ukrainian military casualties...
5,1496738695223083012,MHW_PR,0,0,2022-02-24,"Despite all the threats and warnings, the trag...",uk,0.263,0.682,0.055,-0.8346,despite all the threats and warnings the trag...
6,1496738695554158598,vikram29121971,82,0,2022-02-24,Joe Biden preparing himself to go to War with ...,india,0.415,0.585,0.0,-0.8316,joe biden preparing himself to go to war with ...
7,1496738696028119043,highasapple,146,0,2022-02-24,"RT, SPREAD AND SHARE, YOU CAN HELP UKRAINE #Uk...",poland,0.0,0.55,0.45,0.5994,rt spread and share you can help ukraine
8,1496738696955060225,asitmitt,1268,0,2022-02-24,I strongly condemn #Russia’s reckless attack o...,india,0.374,0.542,0.084,-0.926,i strongly condemn s reckless attack on whi...
9,1496738697475145728,minchinswitchy,141,0,2022-02-24,BREAKING: Ukrainian journalist Ian Pound was ...,australia,0.169,0.732,0.099,-0.3612,breaking ukrainian journalist ian pound was ...


In [41]:
# convert scores into positive, neutral, negative 

# create a list of conditions
conditions = [
              (tweet_sentiments_df['compound'] < 0),
              (tweet_sentiments_df['compound'] == 0),
              (tweet_sentiments_df['compound'] > 0)
              ]

# create a list of values corresponding with each condition
values = ['negative','neutral','positive']


tweet_sentiments_df['sentiment'] = np.select(conditions, values)
tweet_sentiments_df.head()

Unnamed: 0,tweetid,username,retweetcount,favorite_count,tweetcreatedts,text,country,neg,neu,pos,compound,cleaned_text,sentiment
0,1496738679997542403,EuromaidanPR,28,109,2022-02-24,The world must act immediately.- #Ukraine is a...,ukraine,0.226,0.73,0.044,-0.8173,the world must act immediately is at stake ...,negative
1,1496738680203055109,bilalasghar778,29,0,2022-02-24,The historic moment when the PM of Pakistan Mr...,pakistan,0.0,0.932,0.068,0.4019,the historic moment when the pm of pakistan mr...,positive
2,1496738689451380736,gon_deedee,2,0,2022-02-24,Real #Americans stand #UnitedWithBiden AGAINST...,usa,0.0,1.0,0.0,0.0,real stand against against,neutral
3,1496738693675028484,zohrathought,8,0,2022-02-24,"Voices from #Russia: ""Waking up to the news, m...",nz,0.318,0.682,0.0,-0.9562,voices from waking up to the news many rus...,negative
4,1496738694463643652,tyfacts12,119,0,2022-02-24,BREAKING: Over 800 Ukrainian military casualti...,usa,0.0,1.0,0.0,0.0,breaking over ukrainian military casualties...,neutral


In [42]:
tweet_sentiments_df['sentiment'].value_counts()

negative    70316
positive    63842
neutral     30248
Name: sentiment, dtype: int64

### Save the  cleaned_tweets_df dataframe as sqlite data set

In [46]:
import sqlite3
conn = sqlite3.connect('resources/tweets_data.sqlite')
tweet_sentiments_df.to_sql(name='tweets_data', con=conn, if_exists='replace', index=False)
conn.close()

### Save the cleaned_tweets_df dataframe as csv data set

In [47]:
tweet_sentiments_df.to_csv('resources/tweets_data.csv', index=False)

# Save the dataframe as a json file for website

In [49]:
tweet_sentiments_df.to_json('resources/tweets_data.json')