<a href="https://colab.research.google.com/github/filopacio/_python_4_analytics_nlp_project/blob/main/ebola_vs_covid_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---
## Comparing and contrasting ebola and covid spreading of information on Twitter
---




## Install and Import useful packages 

In [1]:
!pip install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
#!pip install twint
!pip install nest_asyncio
!pip install transformers
import pandas as pd
import nest_asyncio
nest_asyncio.apply()
import twint 
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
from sklearn.feature_extraction.text import CountVectorizer
from transformers import pipeline
from matplotlib import pyplot as plt

Collecting twint
  Cloning https://github.com/twintproject/twint.git (to revision origin/master) to /tmp/pip-install-q5fbb8rb/twint
  Running command git clone -q https://github.com/twintproject/twint.git /tmp/pip-install-q5fbb8rb/twint
  Running command git checkout -q origin/master
Building wheels for collected packages: twint
  Building wheel for twint (setup.py) ... [?25l[?25hdone
  Created wheel for twint: filename=twint-2.1.21-cp37-none-any.whl size=38872 sha256=8be3d47c83ea81c0c2f7cc65e91c485ff6782b32c088efbe2bc5af0a2bd03301
  Stored in directory: /tmp/pip-ephem-wheel-cache-ey9lo7vf/wheels/4f/3b/75/62d04b3b446658ba85401e8868d3cd1d4bc22f17ad755460a6
Successfully built twint
Installing collected packages: twint
  Found existing installation: twint 2.1.21
    Uninstalling twint-2.1.21:
      Successfully uninstalled twint-2.1.21
Successfully installed twint-2.1.21
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_

## Scrape tweets 

**Query for "covid"**


In [None]:
nest_asyncio.apply()
# Configure
c = twint.Config()
c.Search = 'covid'
c.Lang   = 'en'
c.Since  = '2020-01-01'
c.Until  = '2021-06-30'
c.Pandas = True
c.Popular_tweets = True
# Run
twint.run.Search(c)
df_c = twint.storage.panda.Tweets_df

**Query for "ebola"**

In [None]:
nest_asyncio.apply()
# Configure
e = twint.Config()
e.Search = 'ebola'
e.Lang = 'en'
e.Since = '2014-03-01'
e.Until = '2015-05-31'
e.Pandas = True
# Run
twint.run.Search(e)
df_e = twint.storage.panda.Tweets_df

## Preprocessing

I created the clean_text function in order to clean the tweets from noisy characters. 

items removed: 
- links
- punctuations/special characters 
- emoticons

Before doing so I also put all the texts in lower case.
I did not remove alphanumeric words to avoid eliminating words like covid19, covid-19 etc.
                

**Text Cleaning**

In [None]:
def clean_text(text):
    text = str(text).lower()
    text = re.sub('https://\S+|www\.\S', '', text)      # remove link
    text = re.sub("['!@#$%^&*()_+<>?:.,;]" , '', text)  # punctuations/special characters
    text = re.sub(re.compile("["                        # emoticon
        u"\U0001F600-\U0001F64F"  
        u"\U0001F300-\U0001F5FF"  
        u"\U0001F680-\U0001F6FF"  
        u"\U0001F1E0-\U0001F1FF"  
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u'\U00010000-\U0010ffff'
        u"\u200d"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\u3030"
        u"\ufe0f"
                           "]+", flags=re.UNICODE), '', text)
    return text

In [None]:
# SI POTREBBERO CREARE TUTTE LE FUNZIONI DI PRE-PROCESSING  
# PER POI TRASFORMARE IL TESTO TUTTO IN UNA VOLTA ALLA FINE

cl_tweets = [clean_text(c) for c in df_c[df_c.language == 'en'].tweet]

words = [sentence.split() for sentence in cl_tweets]

After being cleaned, each tweet is splitted into single words. 
Therefore, 'words' is a list of lists, where each element is a list of separated strings. 
Now, other pre-processing actions will be performed. 

**Stopwords removal**

In [None]:
def remove_stopwords(text):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    text = [[i for i in i.split() if i not in stop] for i in text]
    return text

In [None]:
removed = [remove_stopwords(i) for i in words]

**Lemmatization**

In [None]:
def lemmatize(text):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(i, pos = 'v') for i in i] for i in text]
    return text

In [None]:
lemmatized = [lemmatize(i) for i in removed]

**Stemming**

In [None]:
def stem(text):
   stemmer = SnowballStemmer(language = 'english')
   text = [[stemmer.stem(i) for i in i] for i in text]
   return text

Final outcome of preprocessing

In [None]:
def preprocess(text):
  text = stem(lemmatize(remove_stopwords(clean_text(text))))
  return text

In [None]:
def word2vec(text):
  vectorizer = CountVectorizer()
  matrix = vectorizer.fit_transform(text)
  return matrix 

## Sentiment Analysis 

**Polarity of each tweet**

In [None]:
sentiment_classifier = pipeline('sentiment-analysis')

sentiment_covid = sentiment_classifier(list(df_c.tweet))
sentiment_ebola = sentiment_classifier(list(df_e.tweet))

In [None]:
df_c['sentiment'] = [sentiment_covid[i].get('label') for i in range(len(sentiment_covid))]
df_c['polarity'] =  [sentiment_covid[i].get('score') for i in range(len(sentiment_covid))]

df_e['sentiment'] = [sentiment_covid[i].get('label') for i in range(len(sentiment_ebola))]
df_e['polarity'] =  [sentiment_covid[i].get('score') for i in range(len(sentiment_ebola))]


df_c = df_c[df_c.language == 'en'][['date', 'tweet', 'language', 'username', 'nlikes', 'nretweets','sentiment','polarity']].reset_index().drop(df_c.columns[[0]], axis=1)

df_e = df_e[df_e.language == 'en'][['date', 'tweet', 'language', 'username', 'nlikes', 'nretweets','sentiment','polarity']].reset_index().drop(df_c.columns[[0]], axis=1)

In [35]:
def getTweets(user):
  nest_asyncio.apply()
  u = twint.Config()
  u.Username = user 
  u.Pandas = True
  twint.run.Profile(u)
  df_t = twint.storage.panda.Tweets_df
  return df_t

def getInfo(user):
  nest_asyncio.apply()
  f = twint.Config()
  f.Username= user
  f.Format = 'user {username} | tweets {tweets} | followers {followers}'
  f.Pandas = True
  twint.run.Lookup(f)
  df = twint.storage.panda.User_df
  return df

def getSentiment(user):
  df = getTweets(user)
  sentiment_classifier = pipeline('sentiment-analysis')
  sentiment_user = sentiment_classifier(list(df))
  df['sentiment'] =  [sentiment_user[i].get('label') for i in range(len(sentiment_user))]
  df['polarity']  =  [sentiment_user[i].get('score') for i in range(len(sentiment_user))]
  df = df[df.language == 'en'][['date', 'tweet', 'language', 'username', 'nlikes', 'nretweets','sentiment','polarity']].reset_index().drop(df.columns[[0]], axis=1)
  return df