### Twitter API

Para recopilar datos pasados ​​y presentes usando el módulo snscrape

**Fuentes**

*   [Microsoft - Scrape los datos de Twitter para el análisis de sentimiento con Python](https://techcommunity.microsoft.com/t5/educator-developer-blog/how-to-scrape-twitter-data-for-sentiment-analysis-with-python/ba-p/3593365)
*   [Github Repositorio](https://github.com/shashacode/Sentiment_Analysis/blob/main/final_tweet.ipynb)
*   [FreeCodeCamp - Web Scraping con Python](https://www.freecodecamp.org/news/python-web-scraping-tutorial/)



### Instalando snscrape

In [None]:
!pip install git+https://github.com/JustAnotherArchivist/snscrape.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/JustAnotherArchivist/snscrape.git
  Cloning https://github.com/JustAnotherArchivist/snscrape.git to /tmp/pip-req-build-xuqn878w
  Running command git clone -q https://github.com/JustAnotherArchivist/snscrape.git /tmp/pip-req-build-xuqn878w


In [None]:
!pip install textblob emot nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Obteniendo Tweets

In [None]:
import snscrape.modules.twitter as sntwitter
from langdetect import detect
import pandas as pd

# Created a list to append all tweet attributes(data)
attributes_container = []
keywords= "IBM, #IBM, $IBM".replace(",", " OR")
start = pd.to_datetime('2018-12-30').strftime("%Y-%m-%d") # UTC
end = pd.to_datetime('today').strftime("%Y-%m-%d")
lang = "en"
list_filter = ['verified', 'blue_verified', 'trusted', 'has_engagement']
list_filter = ["filter:{}".format(e) for e in list_filter]
filters = " ".join(list_filter)

query = f"({keywords}) since:{start} until:{end} lang:{lang} {filters}"
n_tweets = 10

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(query).get_items()):
  if len(attributes_container)==n_tweets:
    break
  attributes_container.append([tweet.rawContent])
    
# Creating a dataframe from the tweets list above 
df = pd.DataFrame(attributes_container, columns=["tweet"])
df.to_csv('tweets_scrapped.csv', index=False)
df



In [None]:
df.shape

(100000, 1)

### Preprocesamiento de datos

Esto implica estos pasos necesarios antes de llevar a cabo el análisis de sentimiento Para eliminar las palabras vacías Eliminación de etiquetas, enlaces de URL y otras palabras innecesarias Tokenización de las palabras Lemmitización de palabras

In [None]:
import nltk
nltk.download('popular') #ejecútalo una vez y coméntalo para evitar que se descargue varias veces
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from emot.emo_unicode import UNICODE_EMOJI
import re
import string

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

In [None]:
eng_stop_words = list(stopwords.words('english'))
emoji = list(UNICODE_EMOJI.keys())

In [None]:
# función para preprocesar tweet en preparación para análisis de sentimiento
def ProcessedTweets(text):
    #cambiar el texto del tweet a letras pequeñas
    text = text.lower()
    # Eliminar @ y enlaces
    text = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t]) |(\w+:\/\/\S+)", " ", text).split())
    # eliminando caracteres repetidos
    text = re.sub(r'\@\w+|\#\w+|\d+', '', text)
    # Eliminación de puntuación y números.
    punct = str.maketrans('', '', string.punctuation+string.digits)
    text = text.translate(punct)
    # tokenizar palabras y eliminar palabras vacías del texto del tweet
    tokens = word_tokenize(text)  
    filtered_words = [w for w in tokens if w not in eng_stop_words]
    filtered_words = [w for w in filtered_words if w not in emoji]
    # palabras lemetizantes
    lemmatizer = WordNetLemmatizer() 
    lemma_words = [lemmatizer.lemmatize(w) for w in filtered_words]
    text = " ".join(lemma_words)
    return text

In [None]:
# Genere una nueva columna llamada 'Tweets procesados' 
# aplicando la función de tweets preprocesados ​​a la columna 'Tweet'.
df['clean_tweet'] = df['tweet'].apply(ProcessedTweets)

In [None]:
df.head(5)

Unnamed: 0,tweet,clean_tweet
0,SPSS Exact Tests 製品機能紹介\nhttps://t.co/hCc4BpTf...,spss exact test 製品機能紹介
1,$SPY $QQQ $NDX $DIA $IWM $NFLX $FB $INTC $SMH ...,spy qqq ndx dia iwm nflx fb intc smh aapl nvda...
2,CVE-2022-44755 IBM Notes is susceptible to a s...,cve ibm note susceptible stack based buffer ov...
3,CVE-2022-44753 IBM Notes is susceptible to a s...,cve ibm note susceptible stack based buffer ov...
4,CVE-2022-44751 IBM Notes is susceptible to a s...,cve ibm note susceptible stack based buffer ov...


### Análisis de los sentimientos

Para llevar a cabo esto, la puntuación de polaridad se obtiene utilizando la biblioteca TextBlob, que generalmente se usa para procesos de PNL. El puntaje de polaridad indica el nivel de cuán buenas o malas son las palabras utilizadas en el tweet. Después de obtener la polaridad, se establece una condición para obtener los sentimientos.

In [None]:
import textblob
from textblob import TextBlob

In [None]:
# Función para puntaje de polaridad
def polarity(tweet):
  polarity_m = TextBlob(tweet).sentiment.polarity
  if polarity_m < 0:
        return -1
  elif polarity_m == 0:
      return 0
  else:
      return 1

In [None]:
# usando las funciones para obtener la polaridad y el sentimiento
df['category'] = df['clean_tweet'].apply(polarity)

### Guardando en CSV

In [None]:
df_clean = df.drop(columns=["tweet"])

In [None]:
df_clean.to_csv('tweets_data.csv', index=False)

In [None]:
df_clean.head()

Unnamed: 0,clean_tweet,category
0,spss exact test 製品機能紹介,1
1,spy qqq ndx dia iwm nflx fb intc smh aapl nvda...,1
2,cve ibm note susceptible stack based buffer ov...,-1
3,cve ibm note susceptible stack based buffer ov...,-1
4,cve ibm note susceptible stack based buffer ov...,-1
