<a class = "anchor" id = "top"></a>

# NLP- English
## Source: Twitter
---
### Authors: Gordon Amoako, Zan Sadiq
---

Table of Contents:
* [Data](#data)
* [Pre-Processing](#eda)
* [Networking](#network)
* [NLP](#nlp)
* [Conclusion](#end)
---

## Data <a class = "anchor" id = "data"></a>

In [47]:
# Import libraries
import pandas as pd
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType, ArrayType, FloatType
import re
import string
from pyspark.sql.functions import udf, col, size, lit, explode, isnan, when, count, min, max, struct
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, IndexToString, StringIndexer, VectorIndexer, CountVectorizer

In [43]:
# Function to get hashtags
def extract_hashtags(x):
    
    hashtag_list = []
      
    # splitting the text into words
    for word in x.split():
          
        # checking the first charcter of every word
        if word[0] == '#':
              
            # adding the word to the hashtag_list
            hashtag_list.append(word[1:])
      
    return hashtag_list

# Function to process text
def clean_tweet(tweet):

    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t]) |(\w+:\/\/\S+)", " ", tweet.lower()).split())

# Function to further process text
def more_cleaning(tweet):
    
    tweet = re.sub("@[A-Za-z0-9_]+","", tweet)
    tweet = re.sub("#[A-Za-z0-9_]+","", tweet)
    tweet = ''.join([i for i in tweet if not i.isdigit()])
    tweet = " ".join(re.split("\s+", tweet, flags = re.UNICODE))
    
    return tweet

# Function to filter pos
def filter_pos(x):

    x = nltk.pos_tag(x)
    x = [i[0] for i in x if i[1].startswith(('N', 'A', 'J'))]

    return x

In [3]:
# Get wd
os.getcwd()

'/home/dataguy/Documents'

In [4]:
# Initialize spark
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
spark = SparkSession.builder.getOrCreate()

22/08/15 08:10:11 WARN Utils: Your hostname, computer resolves to a loopback address: 127.0.1.1; using 192.168.1.159 instead (on interface wlp0s20f3)
22/08/15 08:10:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/08/15 08:10:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/15 08:10:11 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/08/15 08:10:11 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/08/15 08:10:11 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
22/08/15 08:10:11 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
22/08/15 08:10:11 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
22/08/15 08:10:11 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
22/08/15 08:10:11 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047.
22/08/15 08:10:11 WARN Utils: Service 'SparkUI' could not bind on port 4047. Attempting port 4048.


In [5]:
df = spark.read.json('/home/dataguy/news.json')



22/08/15 08:10:25 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


                                                                                

In [6]:
df.printSchema()

root
 |-- _type: string (nullable = true)
 |-- cashtags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- content: string (nullable = true)
 |-- conversationId: long (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- _type: string (nullable = true)
 |    |-- latitude: double (nullable = true)
 |    |-- longitude: double (nullable = true)
 |-- date: string (nullable = true)
 |-- hashtags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- id: long (nullable = true)
 |-- inReplyToTweetId: long (nullable = true)
 |-- inReplyToUser: struct (nullable = true)
 |    |-- _type: string (nullable = true)
 |    |-- created: string (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- descriptionUrls: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |  

In [7]:
df.count()

                                                                                

2828592

In [8]:
# Show nulls
#cols = df.columns
#df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in cols]).show()

In [12]:
# Inspect
min_date, max_date = df.select(min("date"), max("date")).first()
print(f"Min Date- {min_date}")
print(f"Max Date- {max_date}")



Min Date- 2022-08-04T11:38:14+00:00
Max Date- 2022-08-14T15:51:17+00:00


                                                                                

[Back to top...](#top)

## Pre-Processing <a class = "anchor" id = "eda"></a>

In [16]:
df1 = df.withColumn('username', col('user.username')).withColumn('country', col('place.country')).withColumn('country_cd', col('place.countryCode')).drop('user', 'coordinates', 'place')

In [18]:
df1.groupBy('country').count().show()



+--------------------+-------+
|             country|  count|
+--------------------+-------+
|              Russia|     99|
|              Sweden|     93|
|     The Netherlands|    174|
|              Guyana|      4|
|            Malaysia|    150|
|           Singapore|     73|
|              Turkey|    173|
|                Iraq|      9|
|             Germany|    396|
|              France|    400|
|              Greece|    123|
|           Sri Lanka|     96|
|Republic of the P...|    312|
|              Taiwan|     49|
|                null|2781648|
|           Argentina|    122|
|             Belgium|     56|
|               Qatar|     25|
|               Ghana|    262|
|       United States|  17953|
+--------------------+-------+
only showing top 20 rows





In [41]:
df1.select('inReplyToUser').take(1)

[Row(inReplyToUser=Row(_type='snscrape.modules.twitter.User', created=None, description=None, descriptionUrls=None, displayname='NYT Politics', favouritesCount=None, followersCount=None, friendsCount=None, id=14434063, label=None, linkTcourl=None, linkUrl=None, listedCount=None, location=None, mediaCount=None, profileBannerUrl=None, profileImageUrl=None, protected=None, rawDescription=None, statusesCount=None, url='https://twitter.com/nytpolitics', username='nytpolitics', verified=None))]

In [42]:
df1.select('mentionedUsers').take(1)

[Row(mentionedUsers=[Row(_type='snscrape.modules.twitter.User', created=None, description=None, descriptionUrls=None, displayname='NYT Politics', favouritesCount=None, followersCount=None, friendsCount=None, id=14434063, label=None, linkTcourl=None, linkUrl=None, listedCount=None, location=None, mediaCount=None, profileBannerUrl=None, profileImageUrl=None, protected=None, rawDescription=None, statusesCount=None, url='https://twitter.com/nytpolitics', username='nytpolitics', verified=None)])]

In [70]:
df1.filter(~col('quotedTweet').isNull()).select('quotedTweet.user.username').take(1)

[Row(username='AmarUjalaNews')]

In [73]:
df2 = df1.withColumn('quoted', col('quotedTweet.user.username')).drop('quotedTweet').withColumn('in_reply_to', col('inReplyToUser.username')).withColumn('mentions', col('mentionedUsers.username')).withColumn('reply_to', col('inReplyToUser.username')).drop('inReplyToUser', 'mentionedUsers').filter("country == 'United States'").toPandas().drop(['renderedContent', 'id', 'media', 'outlinks', '_type', 'cashtags', 'conversationId', 'inReplyToTweetId', 'source', 'sourceUrl', 'sourceLabel', 'tcooutlinks', 'url', 'country', 'lang'], axis = 1)

                                                                                

In [74]:
df2.head()

Unnamed: 0,content,date,hashtags,likeCount,quoteCount,replyCount,retweetCount,retweetedTweet,username,country_cd,quoted,in_reply_to,mentions,reply_to
0,@VaticanNews @Synod_va a day ago Vatican news ...,2022-08-14T15:48:01+00:00,,0,0,0,0,,JmDV808,US,,JmDV808,"[VaticanNews, Synod_va]",JmDV808
1,"This is great news, been looking for an excuse...",2022-08-14T15:47:15+00:00,,0,0,0,0,,BigCountryPhil,US,SarahTheHaider,,,
2,One thing about Ghanaian men they will marry. ...,2022-08-14T15:46:08+00:00,,1,0,0,1,,VickieRemoe,US,,,,
3,News Release from #UCLA on ArcStorm 2.0 study ...,2022-08-14T15:42:50+00:00,[UCLA],0,0,0,0,,EpipilotIDWx,US,,,,
4,WGN is just a trash ass news station. But she ...,2022-08-14T15:42:41+00:00,,1,0,0,0,,joshuacharles__,US,A_Daneshzadeh,,,


In [67]:
df2.shape[0]

17953

[Back to top...](#top)

## Networking <a class = "anchor" id = "network"></a>

[Back to top...](#top)

## NLP: <a class = "anchor" id = "nlp"></a>

[Back to top...](#top)

## Conclusion: <a class = "anchor" id = "end"></a>

[Back to top...](#top)