# Tweet Scraper

Este es el notebook que usaremos para el scrapeo de datos


## Librerias

Para el scrapeo usaremos las siguientes librerias:
* Sntwitter
* Pandas

In [1]:
import datetime
from datetime import timedelta

import snscrape.modules.twitter as snstwitter

import pandas as pd
import seaborn as sns

from tqdm import tqdm

from src.JATS import JATS

## Parametros



In [6]:
from JATS.src.JATS.analyzer import Analyzer
from nltk.corpus import stopwords
a = Analyzer()
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))

In [12]:
text = "Hi, today I hate bitcoin and all their products. Call me Mark Tyson"
print(" ".join([w for w in text.split() if not w in stop_words]))

Hi, today I hate bitcoin products. Call Mark Tyson


In [7]:
%%timeit text = "Hi, today I hate bitcoin and all their products"
filtered_sentence = " ".join([w for w in text.split() if not w in stop_words])

a.get_sentiment(filtered_sentence)

65.7 µs ± 847 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [8]:
%%timeit text = "Hi, today I hate bitcoin and all their products"
a.get_sentiment(text)

60 µs ± 975 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [None]:
query = '"BTC" OR "bitcoin"'

date_from = datetime.date(2018, 4, 1)
date_until = datetime.date(2018, 4, 30)

tweet_list = JATS.get_tweets(query, date_from, date_until, verbose = True)

In [None]:
columnNames = [
    'Datetime',
    'Tweet Id',
    'Text', 
    'NumReplies',
    'NumRetweets',
    'NumLikes', 
    'IDOriginalRetweeted', 
    'Username',
    'isVerified'
]
tweet_df = pd.DataFrame(tweet_list, columns=columnNames)

### Lectura de ficheros ya existentes

In [116]:
file_name = "data/tweets/2018-03-02/2018-04-03/tweet_list.csv"
tweet_df = pd.read_csv(file_name, sep=';')

tweet_df["Datetime"] = pd.to_datetime(tweet_df["Datetime"])

FileNotFoundError: [Errno 2] No such file or directory: 'data/tweets/2018-04-01/2018-04-02/tweet_list.csv'

## Analisis de sentimientos

### Preparación del Analisis

En caso de ser nuestra primera ejecución, deberemos instalar un conjunto de datasets utiles para *nltk*.

In [None]:
from JATS.src.JATS.analyzer import Analyzer

## Analisis de Sentimiento

Eliminaremos los valores nulos ya que parece que cuando el algoritmo no es capaz de determinar el sentimiento, tiende a ponerle un 0, creando una desviación del sentimiento real.

Lo primero que haremos será mostrar la **media del sentimiento** y una **gráfica de distribución del sentimiento**.

In [42]:
a = Analyzer()

In [43]:
a.analyze(tweet_df, "data/tweets/2018-04-01/2018-04-02") # Saved to a CSV

## Analisis de Similitudes

Tenemos que comprobar la existencia de tweets similares para evitar el SPAM que existe en mensajes que no son completamente identicos.

Para ellos haremos uso de la metrica de similitud Cosine Similarity y despues aplicaremos un DBScan para asignar clusters a esos tweets.

In [None]:
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

def get_cosine_similarity(cleaned_texts):
    vectorizer = CountVectorizer().fit_transform(cleaned_texts)
    vectors = vectorizer.toarray()
    return cosine_similarity(vectors)

In [None]:

print(tweet_df.isnull().sum())
tweet_df = tweet_df.dropna(axis=0, subset=['Text'])
print(tweet_df.isnull().sum())

csim = get_cosine_similarity(tweet_df['Text'])

In [None]:
from sklearn.cluster import DBSCAN
import numpy as np
clustering = DBSCAN(eps=1.04, min_samples=1).fit(csim)
unique_elements, counts_elements = np.unique(clustering.labels_, return_counts=True)
print(type(clustering.labels_))
print(np.asarray((unique_elements, counts_elements)))

In [None]:
print(csim)

In [None]:
tweet_df['Prediction'] = clustering.labels_.tolist()
df = pd.DataFrame
tweet_df[tweet_df['Prediction'] < 1].head()

In [12]:
import pandas as pd

column = ["DateTime","Cosa"]
df = pd.DataFrame(columns=column)

In [14]:
df.count().reset_index()

Unnamed: 0,index,0
0,DateTime,0
1,Cosa,0


In [1]:
class Foo: pass
class Bar(Foo): pass
class Bar2(Foo): pass
class Bar(Bar): pass

In [11]:
Foo.__subclasses__()

[__main__.Bar, __main__.Bar2]

In [16]:
[ q.__name__ for q in Foo.__subclasses__()]


['Bar', 'Bar2']

In [20]:
[ q() for q in Foo.__subclasses__() if q.__name__ == "Bar"]

[<__main__.Bar at 0x241e1500af0>]

In [46]:
import pandas as pd


d = {'col1': [1, 2,1,2,4,7,1], 
     'col2': [1, 2,1,1,4,1,1],
     'col3': [11, 32,41,14,4,4,18],
     'col5': ['11', '32','41','14','4','4','18']}
df = pd.DataFrame(data=d)

In [53]:
df.groupby('col1', as_index=False).mean().columns

Index(['col1', 'col2', 'col3'], dtype='object')