# Introduction
The aim of this project is to conduct a sentiment analysis using python and twitter's API. The topic in my case is Expo 2020.

**What is Expo 2020?**
> Initiated in London 1851. It is a global gathering aimed to find solutions to challenges imposed by the current times. Aims  to create enriching and immersive experience. World expo traverses through different cities each time. It also revolves around certain themes. Currently, World expo is taking place in Dubai, UAE. Between Oct, 2021 and Mar, 2022. For more information please visit [expo2020dubai](https://www.expo2020dubai.com/)

**The goal of this project**
> As discussed above the main goal is to conduct a sentiment analysis and learn the basics of data science and big data projects. Since expo 2020 is considered an educational exhibition that revolves around modern day problems and is hosted currently in the middle east. The aim is to measure the awareness of the arab society - *By arab society we mean anyone who posts their opinions in arabic*



This part is about preprocessing the data. we will preform quality assessments and handle any data problems that is found.

In [2]:
import pandas as pd
import numpy as np
import pyarabic.araby as araby


### 1. Read the Data 
--- 
It is required to first read the data and ensure that it has ben parsed correctly. the following code block will read and store the .csv file into a data frame that we will clean up accordingly.

In [3]:
dirty_tweets = pd.read_csv('tweets_updated.csv')
display(dirty_tweets.head())
display(dirty_tweets.tail())


Unnamed: 0,index,ID,Tweet,Timestamp,Likes,Retweets,Length
0,0,1497300372569739264,قبيل الاحتفال بـ #اليوم_الدولي_للمرأة.. تستعد ...,2022-02-25 20:00:01+00:00,1,0,198
1,1,1497299670522994691,#JIMMY_4040 \nلحظات نزول دابانغ سلمان خان على ...,2022-02-25 19:57:14+00:00,2,0,207
2,2,1497297255245754371,ليلة مميزة بإنتظارنا في @expo2020dubai يوم ٢٦ ...,2022-02-25 19:47:38+00:00,0,0,232
3,3,1497295489481510914,@NancyAFashion أصدق اي حفلة احلى ولا اي فستان ...,2022-02-25 19:40:37+00:00,3,1,168
4,4,1497290589800480779,#JIMMY_4040 \nجنون ما بعده جنون دابانغ سلمان خ...,2022-02-25 19:21:09+00:00,5,2,178


Unnamed: 0,index,ID,Tweet,Timestamp,Likes,Retweets,Length
12470,12470,1509875727558189057,بعد توديع آخر ضيوف الحدث الدولي وإسدال الستار ...,2022-04-01 12:50:00+00:00,0,0,192
12471,12471,1509875571114795011,جناح السعودية يختتم فعالياته في معرض “إكسبو 20...,2022-04-01 12:49:22+00:00,0,0,109
12472,12472,1509875169782865932,#إكسبو_2020_دبي يختتم أعماله بأهم المؤتمرات ال...,2022-04-01 12:47:47+00:00,7,0,196
12473,12473,1509874863539998768,"نظمت الإدارة العامة لأمن المطارات ومبادرة ""الر...",2022-04-01 12:46:34+00:00,1,0,221
12474,12474,1509874781964980273,ختامها مسك..جناح #المملكة في #اكسبو_٢٠٢٠ #دبي ...,2022-04-01 12:46:14+00:00,1,0,169


### 1. Asses Duplicates
---
We will be assessing both duplicates and null vallues and based on our observation we will conduct the proper cleaning process
> We will find the duplicates of the tweets column only, since the other columns are gaurnteed to have duplicates and they are simply metadata. The function gets the duplicate rows without the first occurunce. we will then sum and check how much of our data is duplicated.

In [4]:
# Issue 1 - Duplicates
duplicate_tweets = dirty_tweets[dirty_tweets.duplicated(['Tweet'], keep='first')]
display(duplicate_tweets.tail())

print(f'the number of duplicates is {duplicate_tweets.shape[0]}')

Unnamed: 0,index,ID,Tweet,Timestamp,Likes,Retweets,Length
12453,12453,1509881814986412045,شُوفو زعِيم حَضنهآ كُيف خلآهآ 🇦🇪❤️\n\nاكسبو ٢٠...,2022-04-01 13:14:11+00:00,0,0,101
12454,12454,1509881803401801728,@HHShkMohd مبارك طويل العمر هذا الإنجاز الكبير...,2022-04-01 13:14:08+00:00,0,0,130
12457,12457,1509878788221513734,#الإمارات_تبتكر تكرم أفضل الابتكارات في الأجنح...,2022-04-01 13:02:09+00:00,2,0,128
12464,12464,1509877148550586373,@alkhaleej @expo2020dubai انصصصصصدمت انه بس ها...,2022-04-01 12:55:39+00:00,0,0,140
12470,12470,1509875727558189057,بعد توديع آخر ضيوف الحدث الدولي وإسدال الستار ...,2022-04-01 12:50:00+00:00,0,0,192


the number of duplicates is 4493


In [7]:
# Issue 1 - Fix
dirty_tweets = dirty_tweets.drop_duplicates(subset='Tweet', keep="first")
dirty_tweets = dirty_tweets.reset_index(drop=True)
dirty_tweets = dirty_tweets.drop(['index'], axis=1)
display(dirty_tweets.tail())
print("Length", dirty_tweets.shape[0])


Unnamed: 0,ID,Tweet,Timestamp,Likes,Retweets,Length
7977,1509875797544378374,عمان المجد والتاريخ\nوالتقدم العلمي وحصول\nالس...,2022-04-01 12:50:16+00:00,0,0,112
7978,1509875571114795011,جناح السعودية يختتم فعالياته في معرض “إكسبو 20...,2022-04-01 12:49:22+00:00,0,0,109
7979,1509875169782865932,#إكسبو_2020_دبي يختتم أعماله بأهم المؤتمرات ال...,2022-04-01 12:47:47+00:00,7,0,196
7980,1509874863539998768,"نظمت الإدارة العامة لأمن المطارات ومبادرة ""الر...",2022-04-01 12:46:34+00:00,1,0,221
7981,1509874781964980273,ختامها مسك..جناح #المملكة في #اكسبو_٢٠٢٠ #دبي ...,2022-04-01 12:46:14+00:00,1,0,169


Length 7982


In [9]:
# Issue 2 - Fix
def remove_diacritics(tweet):
    tweet = araby.strip_diacritics(tweet)
    tweet = araby.strip_tashkeel(tweet)
    tweet = araby.strip_tatweel(tweet)
    tweet = araby.normalize_alef(tweet)
    tweet = araby.normalize_hamza(tweet)
    tweet = araby.normalize_ligature(tweet)
    return tweet


 
اعمال



In [11]:

def clean_up(tweet):
    tweet = remove_diacritics(tweet)
 

In [20]:
dirty_tweets = dirty_tweets.apply(lambda x: remove_diacritics(x) if x.name == 'Tweet' else x, axis=0)
# dirty_tweets['Tweet'] = dirty_tweets['Tweet'].apply(lambda x: clean_up(x))


In [21]:
display(dirty_tweets.head(20))

Unnamed: 0,ID,Tweet,Timestamp,Likes,Retweets,Length
0,1497300372569739264,,2022-02-25 20:00:01+00:00,1,0,198
1,1497299670522994691,,2022-02-25 19:57:14+00:00,2,0,207
2,1497297255245754371,,2022-02-25 19:47:38+00:00,0,0,232
3,1497295489481510914,,2022-02-25 19:40:37+00:00,3,1,168
4,1497290589800480779,,2022-02-25 19:21:09+00:00,5,2,178
5,1497289658564284418,,2022-02-25 19:17:27+00:00,2,0,163
6,1497288641428824073,,2022-02-25 19:13:25+00:00,4,1,147
7,1497284121957060608,,2022-02-25 18:55:27+00:00,7,2,251
8,1497273383461007364,,2022-02-25 18:12:47+00:00,1,0,277
9,1497270659688407044,,2022-02-25 18:01:57+00:00,0,0,199
