# Diagnóstico Capstone - Benjamín Lillo

Para que este notebook funcione, se debe descargar el dataset desde [este link](https://www.kaggle.com/datasets/prathamsharma123/farmers-protest-tweets-dataset-raw-json) y ubicarlo en la carpeta `dataset` sin descomprimirlo.

La función ´main´ se encuentra al final de este notebook, y es necesario correr todas las celdas anteriores para que funcione.

A continuación, se importará el dataset a un dataframe de pandas y se limpiarán los datos. El código fue obtenido de https://www.kaggle.com/code/prathamsharma123/clean-raw-json-tweets-data

In [1]:
import pandas as pd
from pandas.io.json import json_normalize
import warnings
warnings.filterwarnings("ignore")

El siguiente proceso puede tomar varios minutos.

In [2]:
raw_tweets = pd.read_json(r'dataset/archive.zip', lines=True)
raw_tweets = raw_tweets[raw_tweets['lang']=='en']
print("Shape: ", raw_tweets.shape)
raw_tweets.head(5)

Shape:  (417511, 21)


Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,quoteCount,conversationId,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ShashiRajbhar6/status/1376...,2021-03-30 03:33:46+00:00,Support 👇\n\n#FarmersProtest,Support 👇\n\n#FarmersProtest,1376739399593910273,"{'username': 'ShashiRajbhar6', 'displayname': ...",[],[],0,0,...,0,1376739399593910273,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,
1,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:33:23+00:00,Supporting farmers means supporting our countr...,Supporting farmers means supporting our countr...,1376739306287427584,"{'username': 'kaursuk06272818', 'displayname':...",[],[],0,0,...,0,1376739306287427584,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
2,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:31:00+00:00,Support farmers if you are related to food #St...,Support farmers if you are related to food #St...,1376738704128020488,"{'username': 'kaursuk06272818', 'displayname':...",[],[],0,0,...,0,1376738704128020488,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
3,https://twitter.com/SukhdevSingh_/status/13767...,2021-03-30 03:30:45+00:00,#StopHateAgainstFarmers support #FarmersProtes...,#StopHateAgainstFarmers support #FarmersProtes...,1376738640542400518,"{'username': 'SukhdevSingh_', 'displayname': '...",[],[],0,1,...,0,1376738640542400518,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,
4,https://twitter.com/Davidmu66668113/status/137...,2021-03-30 03:30:30+00:00,"You hate farmers I hate you, \nif you love the...","You hate farmers I hate you, \nif you love the...",1376738579171344386,"{'username': 'Davidmu66668113', 'displayname':...",[],[],0,0,...,0,1376738579171344386,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,


In [3]:
user_id = []
for user in raw_tweets['user']:
    uid = user['id']
    user_id.append(uid)
raw_tweets['userId'] = user_id

# Remove less important columns
cols = ['url', 'date', 'renderedContent', 'id', 'userId', 'replyCount', 'retweetCount', 'likeCount', 'quoteCount', 'source', 'media', 'retweetedTweet', 'quotedTweet', 'mentionedUsers']
tweets = raw_tweets[cols]
tweets.rename(columns={'id':'tweetId', 'url':'tweetUrl'}, inplace=True)
tweets['date'] = pd.to_datetime(tweets['date']).dt.date
tweets.head(5)

Unnamed: 0,tweetUrl,date,renderedContent,tweetId,userId,replyCount,retweetCount,likeCount,quoteCount,source,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ShashiRajbhar6/status/1376...,2021-03-30,Support 👇\n\n#FarmersProtest,1376739399593910273,1015969769760096256,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",,,,
1,https://twitter.com/kaursuk06272818/status/137...,2021-03-30,Supporting farmers means supporting our countr...,1376739306287427584,1332937272581263362,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
2,https://twitter.com/kaursuk06272818/status/137...,2021-03-30,Support farmers if you are related to food #St...,1376738704128020488,1332937272581263362,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
3,https://twitter.com/SukhdevSingh_/status/13767...,2021-03-30,#StopHateAgainstFarmers support #FarmersProtes...,1376738640542400518,1308356658582618112,0,1,3,0,"<a href=""http://twitter.com/download/android"" ...",,,,
4,https://twitter.com/Davidmu66668113/status/137...,2021-03-30,"You hate farmers I hate you, \nif you love the...",1376738579171344386,1357311756532649985,0,0,1,0,"<a href=""http://twitter.com/download/android"" ...",,,,


In [4]:
tweets = pd.DataFrame(tweets)
tweets.drop_duplicates(subset=['tweetId'], inplace=True)
print("Shape: ", tweets.shape)
tweets.head(5)

Shape:  (417511, 14)


Unnamed: 0,tweetUrl,date,renderedContent,tweetId,userId,replyCount,retweetCount,likeCount,quoteCount,source,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ShashiRajbhar6/status/1376...,2021-03-30,Support 👇\n\n#FarmersProtest,1376739399593910273,1015969769760096256,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",,,,
1,https://twitter.com/kaursuk06272818/status/137...,2021-03-30,Supporting farmers means supporting our countr...,1376739306287427584,1332937272581263362,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
2,https://twitter.com/kaursuk06272818/status/137...,2021-03-30,Support farmers if you are related to food #St...,1376738704128020488,1332937272581263362,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
3,https://twitter.com/SukhdevSingh_/status/13767...,2021-03-30,#StopHateAgainstFarmers support #FarmersProtes...,1376738640542400518,1308356658582618112,0,1,3,0,"<a href=""http://twitter.com/download/android"" ...",,,,
4,https://twitter.com/Davidmu66668113/status/137...,2021-03-30,"You hate farmers I hate you, \nif you love the...",1376738579171344386,1357311756532649985,0,0,1,0,"<a href=""http://twitter.com/download/android"" ...",,,,


In [5]:
def tten_tweets():
  return tweets.nlargest(10, "retweetCount", keep="first")

In [6]:
def tten_users():
  '''
  devuelve la lista de userId con su cantidad de tweets emitidos
  '''
  
  return tweets['userId'].value_counts()[:10]

In [7]:
def tten_days():
  '''
  devuelve los días más repetidos junto con la cantidad de tweets
  '''
  return tweets['date'].value_counts()[:10]

In [8]:
import re
from collections import Counter

def tten_hashtags():
  ''' busca los hashtags de entre todos los tweets y 
  retorna los 10 más repetidos
  '''
  hashtags = []
  for content in tweets["renderedContent"]:
    hashtags += re.findall(r"#(\w+)", content)
  # print(hashtags)
  counter = Counter(hashtags)
  return counter.most_common(10)
  

**Importante**: Correr celdas anteriores para poder utilizar esta función.

In [18]:
def main(mode):
  '''
  mode = 1: top 10 tweets más retweeteados
  mode = 2: top 10 usuarios que más tweets emitieron
  mode = 3: top 10 días donde más se emitieron tweets
  mode = 4: top 10 hashtags más utilizados de entre todos los tweets
  '''
  
  if mode == 1:
    
    print("Tweets más retweeteados")
    return tten_tweets()
  elif mode == 2:
    print("Lista de userId con su cantidad de tweets emitidos")
    return tten_users()
  elif mode == 3:
    print("Días con más tweets junto con la cantidad de tweets")
    return tten_days()
  elif mode == 4:
    print("Hashtags más utilizados de entre todos los tweets junto con la cantidad de ocurrencias")
    return tten_hashtags()
  else:
    return "modo debe ser entre 1 y 4"

In [19]:
main(2)

Hashtags más utilizados de entre todos los tweets junto con la cantidad de ocurrencias


[('FarmersProtest', 404687),
 ('IStandWithFarmers', 15713),
 ('farmersprotest', 15378),
 ('IndianFarmersHumanRights', 11934),
 ('FarmersAreIndia', 10985),
 ('StandWithFarmers', 10612),
 ('Rihanna', 9088),
 ('FarmersProtests', 8707),
 ('Farmers', 6541),
 ('shameonbollywood', 6222)]