# **LUSIP - 8**
# ***COVID-19 Sentiment Analysis***
![Emotions Meter Picture](https://github.com/tanmayvijay/LUSIP-Sentiment-Analysis/blob/master/res/images/Sentiment%20Analysis.jpeg)


This notebook contains our code to perform Sentiment Analysis on COVID-19 Twitter Data.<br>

***Sentiment analysis ** is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques.*


**Team Members:**
> Tanmay Vijay (Rajasthan Technical University, Kota)<br>
> Ayan Chawla (ayan cllg dal do apna)<br>
> Balan Dhanka (balan  cllg dal do apna)<br>





# **Data Collection**

## Data Collection was one of the lengthy and major tasks to perform. 
Data was collected using ***GetOldTweets3*** API and was stored in multiple files state-wise and month-wise. Three keywords(corona, COVID, COVID-19) were searched for colecting tweets shared among diferent states. For instance a file named ***Assam_01_tweets.csv*** contains the tweets shared in Assam during January. <br>

Months <br>
01 = January, 2020 <br>
02 = February, 2020 <br> 
03 = March, 2020 <br>
04 = April, 2020 <br>
05 = May, 2020 <br>
11 = November, 2019 <br>
12 = December, 2019 <br>

Comment cells and functions


In [0]:
# Installing required libraries

!pip install GetOldTweets3

In [0]:
# Imports

import time # To use sleep function
import GetOldTweets3 as got # API library to get tweets data
import pandas as pd # To create dataframes and save CSVs

In [0]:
# Lists of queries to be searched, States and their approximate radius and Date ranges to collect tweets from.

# List of various queries in text
queries_list = ['corona', 'COVID', 'COVID-19',]

# List of all States in India with their approx. radius.
states_list = [
  ['Andhra Pradesh',140], ['Arunachal Pradesh',101], ['Assam',98], ['Bihar',111], ['Chhattisgarh',129],
  ['Goa',21], ['Gujarat',155], ['Haryana',74], ['Himachal Pradesh',83], ['Jammu & Kashmir',165],
  ['Jharkhand',99], ['Karnataka',154], ['Kerala',69], ['Madhya Pradesh',195], ['Maharashtra',195],
  ['Manipur',52], ['Meghalaya',53], ['Mizoram',51], ['Nagaland',43], ['Odisha',138],
  ['Punjab',79], ['Rajasthan',205], ['Sikkim',30], ['Tamil Nadu',127], ['Telangana',117],
  ['Tripura',36], ['Uttarakhand',81],  ['Uttar Pradesh',173], ['West Bengal',104]
]

# List of Date ranges to scrap data within
dates_list = [
  ("2019-11-01", "2019-11-30"), # Nov, 2019
  ("2019-12-01", "2019-12-31"), # Dec, 2019
  ("2020-01-01", "2020-01-31"), # Jan, 2020
  ("2020-02-01", "2020-02-29"), # Feb, 2020
  ("2020-03-01", "2020-03-31"), # Mar, 2020
  ("2020-04-01", "2020-04-30"), # Apr, 2020
  ("2020-05-01", "2020-05-30"), # May, 2020
]

In [0]:
# Using GetOldTweets3 to collect all tweets data
# Queries created using above mentioned queries_list, states_list, dates_list

for state, within in states_list: # Loop over states
  for since_date, until_date in dates_list: # Loop over Date Ranges
    text_tweets = [] # Initialize empty list for each month

    i = 0
    while i < len(queries_list): # Loop over Query text
      query_text = queries_list[i]
      
      # Creating Query Object
      tweetCriteria = got.manager.TweetCriteria().setQuerySearch(query_text).setMaxTweets(1000).setNear(state).setSince(since_date).setUntil(until_date).setLang("en").setWithin(f"{within}mi")
      
      # Getting Tweets from API
      try:
        tweets = got.manager.TweetManager.getTweets(tweetCriteria)
      except: # In case of API exceptions
        print("Error at", i)
        time.sleep(10) # To account for API call limits
        continue

      text_tweets.extend( [tweet.text for tweet in tweets] )
      i += 1
      
    # Creating a DataFrame Object from Collected Tweets
    df = pd.DataFrame({"Tweets": text_tweets}, columns=['Tweets',]) 
    df = df.drop_duplicates().reset_index(drop=True) # Removing Duplicate Data points if any

    df.to_csv(f"{state[:-7]}_{since_date[5:7]}_tweets.csv", index=False) # Saving data as a CSV

  print(state, end=", ") # Print State that is successfully completed
print("\nDone") # Success

Andhra Pradesh, Arunachal Pradesh, Assam, Bihar, Chhattisgarh, Goa, Gujarat, Haryana, Himachal Pradesh, Jammu & Kashmir, Jharkhand, Karnataka, Kerala, Madhya Pradesh, Maharashtra, Manipur, Meghalaya, Mizoram, Nagaland, Odisha, Punjab, Rajasthan, Sikkim, Tamil Nadu, Telangana, Tripura, Uttarakhand, Uttar Pradesh, West Bengal, 
Done


These Files were shared in the Github. <br>
Next Task is to find the Polarities of each Tweet.

# **Finding Polarities**

***Polarity*** is a relationship between two opposite characteristics or tendencies, like the polarity of two sides of a debate, or of the superhero and villain in a comic book. Polarity can literally refer to a positive or negative electric charge.<br>
In NLP, Polarity reflects the Positive or Negative expression of a Tweet.<br>
#########################
Example-:(Give from any file)

To find the polarities ***TextBlob*** API was used on each tweet and its Polarity was stored in Polarity attribute.<br>
Each Tweet must be cleaned before using the API because it can contain Usernames, Hashtags, URLs. For that a function **clean_data** was created and each tweet was passed through it.
<br>
API returns a value in range [-1,+1] where -1 means Extreme Negative Polarity,  +1 means Extreme Positive Polarity and 0 means neutral. <br>
<br>
One more attribute is added to make polarity a Categorical Vaiable called ***Sentiments***. <br>If polarity is in range [-1,0) then the Sentiment is *Negative*. <br>
If polarity is in range (0,+1] then the Sentiment is *Positive*. <br>
If polairty is 0 then Sentiment is *Neutral*.
<br>
<br>
Each value was stored and new CSV files were generated State-wise and Month-wise. <br>

For instance-:
<br>
File named ***Andhra Pradesh_01_polarity.csv*** contains tweets during January month along with the Polarity(Floating Number) and Sentiment(String).


In [0]:
# Installing required libraries

!pip install textblob

In [0]:
# Imports

from textblob import TextBlob
import pandas as pd
import re

In [0]:
# Lists

# List of Indian States
states_list = [
  'Andhra Pradesh', 'Arunachal Pradesh', 'Assam', 'Bihar', 'Chhattisgarh',
  'Goa', 'Gujarat', 'Haryana', 'Himachal Pradesh', 'Jammu & Kashmir',
  'Jharkhand', 'Karnataka', 'Kerala', 'Madhya Pradesh', 'Maharashtra',
  'Manipur', 'Meghalaya', 'Mizoram', 'Nagaland', 'Odisha',
  'Punjab', 'Rajasthan', 'Sikkim', 'Tamil Nadu', 'Telangana',
  'Tripura', 'Uttarakhand', 'Uttar Pradesh', 'West Bengal',
]

# List of Months for which Polarities have to be found
months_list = ['11', '12', '01', '02', '03', '04', '05'] # Nov, 2019 - to - May, 2020

In [0]:
# Function to clean the tweets in the dataset
def clean_data(string):
  """
  This function removes all user-tags (@username),
  special characters (anything not within 0-9, a-z, A-Z, space( ) and tabs(\t) )
  and URL links in the tweets
  """

  return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", string).split())

In [0]:
# Code to find Polarities of each dataset file (Statewise and Monthwise)

for state in states_list: # To loop over states
  for month in months_list: # To loop over Months
    file_name = f"data/{state}_{month}_tweets.csv" # Input file

    # Reading Dataset
    try:
      tweets = pd.read_csv(file_name)
    except:
      # In case file is not found
      print(state, month) # Print the file details
      continue # and skip the file

    # Apply cleaning function on all tweets
    tweets['Tweets'] = tweets['Tweets'].apply(clean_data)

    # Code to find Polarites of individual tweets    
    pol = [] # Empty list to store Polarities

    for index in tweets.index: # Looping over all tweets
      # Finding polarity
      polarity = TextBlob(tweets['Tweets'][index]).sentiment.polarity
      # Adding polarity to list of polarities
      pol.append(polarity)
    
    tweets['Polarity'] = pol # Creating Polarity column in the dataset


    # Code to Categorize polarities found for each tweet
    pol=[] # Empty list to store Categorical Polarities

    for index in tweets.index: # Looping over all tweets
      if tweets['Polarity'][index] > 0:   # Positive case
        pol.append("Positive")
      elif tweets["Polarity"][index] < 0: # Negetive Case
        pol.append("Negative")
      else:                               # Neutral Case
        pol.append("Neutral")

    tweets['Sentiments'] = pol # Creating Sentiments column in the dataset

    # Saving the dataset statewise-monthwise
    tweets.to_csv("output/" + file_name[5:-10] + "polarity.csv", index=False)

## **Combining Data**

creating statewise lists

with polarity

creating overall combined dataset

In [0]:
# Imports
import pandas as pd

In [0]:
# List of all Indian States
states_list = [
  'Andhra Pradesh', 'Arunachal Pradesh', 'Assam', 'Bihar', 'Chhattisgarh',
  'Goa', 'Gujarat', 'Haryana', 'Himachal Pradesh', 'Jammu & Kashmir',
  'Jharkhand', 'Karnataka', 'Kerala', 'Madhya Pradesh', 'Maharashtra',
  'Manipur', 'Meghalaya', 'Mizoram', 'Nagaland', 'Odisha',
  'Punjab', 'Rajasthan', 'Sikkim', 'Tamil Nadu', 'Telangana',
  'Tripura', 'Uttarakhand', 'Uttar Pradesh', 'West Bengal',
]

# List of months to loop over (except 11; 11 is handled manually for initializing the 'big_df')
months_list = ['12', '01', '02', '03', '04', '05']

### ***Combining Data: Statewise***


explain

In [0]:
# Code to create Statewise Datasets with Polarity and Sentiments column
for state in states_list: # Looping over all states
  big_df = pd.read_csv(f"{state}_11_polarity.csv") # Reading Nov, 2019 (11) file for each state
  big_df['Month'] = ['11',]*big_df.shape[0] # Creating Month column in the dataset

  for month in months_list: # Looping over all months
    file_name = f"{state}_{month}_polarity.csv" # File name

    df = pd.read_csv(file_name) # Reading dataset file for each month for each state
    df['Month'] = [month]*df.shape[0] # Creating Month column in the dataset

    big_df = pd.concat([big_df, df], ignore_index=True) # Concatenating both data frames
  
  big_df.to_csv(f"{state}_combined.csv", index=False) # Saving the combined file for all states

### ***Combining Data: Complete***


explain

In [0]:
# Code to Combine entire datset into a single file
big_df = pd.read_csv(f"Andhra Pradesh_combined.csv") # Reading first file manually
big_df['State'] = ['Andhra Pradesh',]*big_df.shape[0] # Creating a State column

for state in states_list: # Looping over all states
  file_name = f"{state}_combined.csv" # File name

  df = pd.read_csv(file_name) # Reading each state file
  df['State'] = [state]*df.shape[0] # Creaing a State column

  big_df = pd.concat([big_df, df], ignore_index=True) # Concatenating both data frames


big_df.to_csv(f"all_states_combined.csv", index=False) # Saving the combined dataset

# **Analysis**

explain everytinh in 3-5 paras

explain use of sentiment analysis

how we approached and all

index of types of analysis done



In [0]:
# Frequency Plot: Statewise - Monthwise

import pandas as pd
import seaborn as sns

# List of Indian States
states_list = [
  'Andhra Pradesh', 'Arunachal Pradesh', 'Assam', 'Bihar', 'Chhattisgarh',
  'Goa', 'Gujarat', 'Haryana', 'Himachal Pradesh', 'Jammu & Kashmir',
  'Jharkhand', 'Karnataka', 'Kerala', 'Madhya Pradesh', 'Maharashtra',
  'Manipur', 'Meghalaya', 'Mizoram', 'Nagaland', 'Odisha',
  'Punjab', 'Rajasthan', 'Sikkim', 'Tamil Nadu', 'Telangana',
  'Tripura', 'Uttarakhand', 'Uttar Pradesh', 'West Bengal',
]

# List of Months for which Polarities have to be found
months_list = ['11', '12', '01', '02', '03', '04', '05'] # Nov, 2019 - to - May, 2020

In [0]:
df_states = []
df_months = []
df_positive = []
df_negetive = []
df_neutral = []

In [0]:
for state in states_list:
  for month in months_list:
    file_name = f"data/Tweets With Polarity/{state}_{month}_polarity.csv"

    try:
      df = pd.read_csv(file_name)
    except:
      print(state, month)
      continue

    vc = df['Sentiments'].value_counts()
    df_states.append(state)
    df_months.append(month+"month")
    try:
      p = int(vc['Positive']) / df.shape[0]
      df_positive.append(p)
    except:
      df_positive.append(0)
    
    try:
      n = int(vc['Negative']) / df.shape[0]
      df_negetive.append(n)
    except:
      df_negetive.append(0)

    try:
      n = int(vc['Neutral']) / df.shape[0]
      df_neutral.append(n)
    except:
      df_neutral.append(0)

In [0]:
vc_df = pd.DataFrame({'State':df_states, 'Month': df_months, 'Positive': df_positive, 'Negative': df_negetive, 'Neutral': df_neutral},
                      columns=['State', 'Month', 'Positive', 'Negative', 'Neutral'])

# **Results**

## **Conclusiosn**

understrandings from analyss

how can this affect the use cases selected above

# **Thanks yous and Citations if any**


