# Studiu teoretic (Natural Language Processing, Sentiment Analysis, Opinion Mining, prelucrare mesaje Twitter / X).
### In domeniul Natural Language Processing - Sentiment Analysis exista foarte multe aplicatii.

 Lucrarea ar avea doua parti

 - partea teoretica: ce presupune de fapt NLP, prelucrari de diverse tipuri pe text -  in partea de clasificare se pot utiliza retele neurale

 - partea aplicativa: implementarea (adaptare eventual a unor exemple gasite pe Internet) si ilustrarea pe un set de date

 Daca partea teoretica este in mare parte aceeasi indiferent de domeniu, o provocare o reprezinta obtinerea unei baze de mesaje / texte pe care sa lucrezi.

 Setul de date: https://www.kaggle.com/datasets/manchunhui/us-election-2020-tweets/data

### Importare + eliminarea coloanelor inutile

In [4]:
import pandas as pd
import string
data_set_1 = pd.read_csv('hashtag_joebiden.csv')
# data_set_1.head()
#drop the columns
data_set_1 = data_set_1.iloc[:,[0]]
data_set_1.head()

Unnamed: 0,tweet
0,#Elecciones2020 | En #Florida: #JoeBiden dice ...
1,#HunterBiden #HunterBidenEmails #JoeBiden #Joe...
2,@IslandGirlPRV @BradBeauregardJ @MeidasTouch T...
3,@chrislongview Watching and setting dvr. Let’s...
4,#censorship #HunterBiden #Biden #BidenEmails #...


### Convertire la lowercase

In [6]:
data_set_1['tweet'] = data_set_1['tweet'].str.lower()
data_set_1.head()

Unnamed: 0,tweet
0,#elecciones2020 | en #florida: #joebiden dice ...
1,#hunterbiden #hunterbidenemails #joebiden #joe...
2,@islandgirlprv @bradbeauregardj @meidastouch t...
3,@chrislongview watching and setting dvr. let’s...
4,#censorship #hunterbiden #biden #bidenemails #...


### Eliminam semnele de punctuatie si simbolurile

In [8]:
def eliminate_punctuations(text):
    extra_chars = "►'—"
    all_punctuations = string.punctuation + extra_chars
    if isinstance(text, str):
        punctuations = string.punctuation
        return text.translate(str.maketrans('', '', all_punctuations))
    else:
        return text

In [9]:
data_set_1['tweet'] = data_set_1['tweet'].apply(eliminate_punctuations)

In [10]:
data_set_1.head(20)

Unnamed: 0,tweet
0,elecciones2020 en florida joebiden dice que d...
1,hunterbiden hunterbidenemails joebiden joebide...
2,islandgirlprv bradbeauregardj meidastouch this...
3,chrislongview watching and setting dvr let’s g...
4,censorship hunterbiden biden bidenemails biden...
5,is this wrong cory bookers brilliant final que...
6,in 2020 nypost is being censorship censored by...
7,tell politicians to stick it with this free i...
8,biden httpstcoqms0pmuev5
9,proof bidens are crooked twitter will suspend...


### Stergem stopwords

In [12]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

STOPWORDS = set(stopwords.words('english'))

def eliminate_stopwords(text):
    return " ".join([word for word in text.split() if word not in STOPWORDS])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cozma\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
data_set_1 = data_set_1.dropna(subset=['tweet']) #elimin toate Nan-urile

In [14]:
data_set_1['tweet'] = data_set_1['tweet'].apply(eliminate_stopwords)
data_set_1.head()

Unnamed: 0,tweet
0,elecciones2020 en florida joebiden dice que do...
1,hunterbiden hunterbidenemails joebiden joebide...
2,islandgirlprv bradbeauregardj meidastouch bide...
3,chrislongview watching setting dvr let’s give ...
4,censorship hunterbiden biden bidenemails biden...


In [15]:
data_set_1.to_csv('data_set_1.csv', index = False)

### Stergem cuvintele frecvente

In [18]:
from collections import Counter
word_count = Counter()
for text in data_set_1['tweet']:
    for word in text.split():
        word_count[word] += 1

word_count.most_common(10)

[('biden', 582840),
 ('joebiden', 345458),
 ('trump', 268541),
 ('de', 93582),
 ('election2020', 88156),
 ('vote', 79102),
 ('president', 60041),
 ('joe', 59554),
 ('la', 54694),
 ('amp', 52275)]

In [20]:
Frequent_Words = set(word for (word, wc) in word_count.most_common(3))
def eliminate_freq_words(text):
    return " ".join([word for word in text.split() if word not in Frequent_Words])

In [35]:
data_set_1['tweet'] = data_set_1['tweet'].apply(eliminate_freq_words)
data_set_1.head(10)

Unnamed: 0,tweet
0,elecciones2020 en florida dice que donaldtrump...
1,hunterbiden hunterbidenemails joebidenmuststep...
2,islandgirlprv bradbeauregardj meidastouch made...
3,chrislongview watching setting dvr let’s give ...
4,censorship hunterbiden bidenemails bidenemail ...
5,wrong cory bookers brilliant final questioning...
6,2020 nypost censorship censored twitter manipu...
7,tell politicians stick free item httpstcosua2f...
8,httpstcoqms0pmuev5
9,proof bidens crooked twitter suspend sharing h...


### Eliminam cuvintele rare

In [27]:
rare_words = set(word for(word, wc) in word_count.most_common()[:-10:-1])
rare_words

{'adnkronoshttpstcop8ksr3tako',
 'conoscerci',
 'emozione’',
 'giacoppolontana',
 'httpstcodechp5ypch',
 'jillbidenmia',
 'jilljacobsla',
 'ladyla',
 '‘sono'}

In [29]:
def eliminate_rare_words(text):
    return " ".join([word for word in text.split() if word not in rare_words])

In [33]:
data_set_1['tweet'] = data_set_1['tweet'].apply(eliminate_rare_words)
data_set_1.head(10)

Unnamed: 0,tweet
0,elecciones2020 en florida dice que donaldtrump...
1,hunterbiden hunterbidenemails joebidenmuststep...
2,islandgirlprv bradbeauregardj meidastouch made...
3,chrislongview watching setting dvr let’s give ...
4,censorship hunterbiden bidenemails bidenemail ...
5,wrong cory bookers brilliant final questioning...
6,2020 nypost censorship censored twitter manipu...
7,tell politicians stick free item httpstcosua2f...
8,httpstcoqms0pmuev5
9,proof bidens crooked twitter suspend sharing h...
