- Consumer key is the API key that a service provider (Twitter, Facebook, etc.) issues to a consumer (a service that wants to access a user's resources on the service provider). This key is what identifies the consumer.

- Consumer secret is the consumer "password" that is used, along with the consumer key, to request access (i.e. authorization) to a user's resources from a service provider.

- Access token is what is issued to the consumer by the service provider once the consumer completes authorization. This token defines the access privileges of the consumer over a particular user's resources. Each time the consumer wants to access the the user's data from that service provider, the consumer includes the access token in the API request to the service provider.

- For further details, take a look at this useful slides from google:

https://docs.google.com/presentation/d/1KqevSqe6ygWVj4U-wlarKU7-SVR79x-vjpR4gEc4A9Q/edit?pli=1#slide=id.g1697c74a_1_14

### Scrapping data from Twitter

In [90]:
import tweepy
import pandas as pd
api_key="iA0v8GU7uy4j9ueiNTt4kYrGf"
api_secret="IGeRFoenuF5YU93SBLuHN5sSmSMM4PswrwaL5y7RYOh5b5uml0"
access_token="2566469852-amCniFui7RavFfBcclbrB3nuXSqkcDvSscsCJf5"
access_secret="7WOC4nSV9WRVFgcjzRquEjG7KPSzQqkOhHzRZTPfywj6C"

auth=tweepy.OAuthHandler(api_key,api_secret)
auth.set_access_token(access_token,access_secret)
api=tweepy.API(auth)

# Most recent modi tweets - Scrapping from 5 pages:
d={}
ls=[]
ls_hash=[]
for i in range(6):
    tweets=api.user_timeline('@narendramodi',page=i,count=200)
    for tweet in tweets:
        ls.append(tweet.text.strip())
        try:
            ls_hash.append(tweet.entities['hashtags'][0]['text'])
        except:
            ls_hash.append(None)
d['Tweet_Text']=ls
d['Tweet_Hash']=ls_hash
df_tweet=pd.DataFrame(d)

### Clean the tweets

#### User Defined functions

In [126]:
import nltk
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# UDF of Data Cleaning:
def clean_text(st):
    tt=TweetTokenizer()
    stopwords_update=stopwords.words('english')
    try:
        clean_ls=[i for i in tt.tokenize(st.lower()) if i not in stopwords_update]
        return(' '.join(clean_ls))
    except:
        pass
    
# UDF for Lemmatization:
def lemma(st):
    lemma=WordNetLemmatizer()
    try:
        lem_st=[lemma.lemmatize(i) for i in st.split()]
        return(' '.join(lem_st))
    except:
        pass

# UDF for Vectorization:
def vector_Tfidf(df_col,grams_min,grams_max,max_fea):
    tf=TfidfVectorizer(ngram_range=(grams_min,grams_max),max_features=max_fea)
    tf_df=pd.DataFrame(tf.fit_transform(df_col).toarray(),columns=tf.get_feature_names())
    return(tf_df)

def vector_count(df_col,grams_min,grams_max,min_dff):
    tf=CountVectorizer(ngram_range=(grams_min,grams_max),min_df=min_dff)
    tf_df=pd.DataFrame(tf.fit_transform(df_col).toarray(),columns=tf.get_feature_names())
    return(tf_df)

# UDF for Sentiment score:
def sentiment(df_col):
    ls_sent=[]
    for i in range(len(df_col)):
        analyzer=SentimentIntensityAnalyzer()
        ls_sent.append(analyzer.polarity_scores(df_col.iloc[i])['compound'])
    return(ls_sent)


In [150]:
ls_clean_text=[]
for i in range(len(df_tweet)):
    ls_clean_text.append((clean_text(df_tweet.iloc[i].values[0])))
    
df_tweet_clean=pd.DataFrame(ls_clean_text,columns=['Clean_Review_Text'])
df_tweet_clean.dropna(inplace=True)
for i in range(len(df_tweet_clean)):
    try:
        df_tweet_clean.iloc[i]+' '+df_tweet['Tweet_Hash'].iloc[i]
    except:
        df_tweet_clean.iloc[i]+' '+str(df_tweet['Tweet_Hash'].iloc[i])   

- max_df is used for removing data values that appear too frequently, also known as "corpus-specific stop words".

- For example:

    - max_df = 0.50 means "It ignores terms that appear in more than 50% of the documents".

    - max_df = 25 means "It ignores terms that appear in more than 25 documents".

- The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus the default setting does not ignore any terms.

- min_df is used for removing terms that appear too infrequently.

- For example:

- min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".

- min_df = 5 means "ignore terms that appear in less than 5 documents".

- The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

### Create the DTM using CountVectorizer; Set min_df = 5

#### Refer to below URL before using CountVecotrizer
- https://stackoverflow.com/questions/57424183/how-to-force-sklearn-countvectorizer-to-not-remove-special-characters-i-e

In [294]:
# As the columns contains special charaters like hash tags, Count Vecotrizer ignores them and only provide text excluding the
#specail charaters as the name of the features. We need to use the specaial parameters not to ignore the special charaters.
#The default regexp select tokens of 2 or more alphanumeric characters 
# punctuation is completely ignored and always treated as a token separator.

def vector_count(df_col,grams_min,grams_max,min_dff):
    tf=CountVectorizer(ngram_range=(grams_min,grams_max),min_df=min_dff,token_pattern='[a-zA-Z0-9!#$*=?@]+')
    tf_df=pd.DataFrame(tf.fit_transform(df_col).toarray(),columns=tf.get_feature_names())
    return(tf_df)
df_tweet_clean_DTM=vector_count(df_tweet_clean['Clean_Review_Text'],1,1,5)
col_update=[]
for i in df_tweet_clean_DTM.columns:
    if len(i)>5:
        col_update.append(i)
df_tweet_clean_DTM=df_tweet_clean_DTM[col_update]
df_tweet_clean_DTM.head()

Unnamed: 0,#diwali,#hunarhaat,#janjankabudget,#mannkibaat,#republicday,@bjp4india,@flotus,@gotabayar,@jairbolsonaro,@melaniatrump,...,victory,vision,welcome,welfare,wishes,wishing,wonderful,worked,working,youngsters
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Using KMeans algorithm, cluster the tweets in to 4 groups

In [295]:
import numpy as np
x=np.array(df_tweet_clean_DTM)
from sklearn.cluster import KMeans
km=KMeans(n_clusters=4,random_state=0)
y=km.fit_predict(x)
df_tweet_clean_DTM['Clusters']=y

### Top 5 words in each group

In [296]:
clust_cen=km.cluster_centers_
top_5=np.argsort(clust_cen)[:,-5:]
ls_c1=[]
ls_c2=[]
ls_c3=[]
ls_c4=[]

for i in range(len(top_5)):
    if i ==0:
        ls_c1.append(top_5[i])
    elif i ==1:
        ls_c2.append(top_5[i])
    elif i ==2:
        ls_c3.append(top_5[i])
    elif i ==3:
        ls_c4.append(top_5[i])
        
col=np.array(df_tweet_clean_DTM.columns)
print('Cluster 1 top 5 terms:',col[ls_c1])
print('Cluster 2 top 5 terms:',col[ls_c2])
print('Cluster 3 top 5 terms:',col[ls_c3])
print('Cluster 4 top 5 terms:',col[ls_c4])             

Cluster 1 top 5 terms: ['elections' 'taking' 'president' 'victory' 'congratulations']
Cluster 2 top 5 terms: ['wonderful' 'congratulate' 'statehood' 'government' 'people']
Cluster 3 top 5 terms: ['delighted' 'tomorrow' 'addressing' 'greetings' 'towards']
Cluster 4 top 5 terms: ['discussions' 'attended' 'extensive' 'excellent' 'meeting']




### Top 5 hashtags in each group ( only applicable for twitter data)

In [311]:
clust_cen=km.cluster_centers_
top_5=np.argsort(clust_cen)
col=df_tweet_clean_DTM.columns
d={}
for i in range(len(top_5)):
    ls=[]
    for j in top_5[i]:
        if col[j].startswith('#'):
            ls.append(j)
    d[i]=ls
col_np=np.array(df_tweet_clean_DTM.columns)
for i in d.keys():
    print('Clusetr',i,'top5 #tags in ascending order of their count:',col_np[d[i]])    

Clusetr 0 top5 #tags in ascending order of their count: ['#republicday' '#hunarhaat' '#diwali' '#janjankabudget' '#mannkibaat']
Clusetr 1 top5 #tags in ascending order of their count: ['#janjankabudget' '#republicday' '#diwali' '#mannkibaat' '#hunarhaat']
Clusetr 2 top5 #tags in ascending order of their count: ['#diwali' '#republicday' '#janjankabudget' '#hunarhaat' '#mannkibaat']
Clusetr 3 top5 #tags in ascending order of their count: ['#mannkibaat' '#republicday' '#hunarhaat' '#diwali' '#janjankabudget']
