# Import libraries

In [1]:
import json         
from datetime import datetime
import pandas as pd

---

# Open file

Taking a quick look at the data file, we can see that it consists of a collection of json objects, as shows the image bellow.

![img](../img/data_file.png "screenshot data file")

The amount of data sums up to 117407 lines of json objects.

## Get data

To work with the data we have to extract it from the json file. To do that will run the follwing lines of code.

In [2]:
file_path = "farmers-protest-tweets-2021-2-4.json"
# file_path = "test.json"
with open(file_path, 'r') as file:
    data_list = []
    for line in file:
        object_json = json.loads(line)
        data_list.append(object_json)

Now, all data is in the `data_list` list. Each dictionary inside the original json file correspond to a postion in the list.

From the first element of the list we can extract the following information.

In [3]:
print("Amount of data lines: ", len(data_list))
first = data_list[0]
print("First obeject keys: ", first.keys())
print("First obeject keys lenght: ", len(first.keys()))
print("First object: ", json.dumps(first, indent=4))


Amount of data lines:  117407
First obeject keys:  dict_keys(['url', 'date', 'content', 'renderedContent', 'id', 'user', 'outlinks', 'tcooutlinks', 'replyCount', 'retweetCount', 'likeCount', 'quoteCount', 'conversationId', 'lang', 'source', 'sourceUrl', 'sourceLabel', 'media', 'retweetedTweet', 'quotedTweet', 'mentionedUsers'])
First obeject keys lenght:  21
First object:  {
    "url": "https://twitter.com/ArjunSinghPanam/status/1364506249291784198",
    "date": "2021-02-24T09:23:35+00:00",
    "content": "The world progresses while the Indian police and Govt are still trying to take India back to the horrific past through its tyranny. \n\n@narendramodi @DelhiPolice Shame on you. \n\n#ModiDontSellFarmers \n#FarmersProtest \n#FreeNodeepKaur https://t.co/es3kn0IQAF",
    "renderedContent": "The world progresses while the Indian police and Govt are still trying to take India back to the horrific past through its tyranny. \n\n@narendramodi @DelhiPolice Shame on you. \n\n#ModiDontSellFarmer

From `Amount of data lines` we can see that all 117407 lines were extrated from the file. The first dictionary has 21 keys.

To get a better look in all the data in a agregated form we'll create a dataframe from the list using the pandas module.

In [4]:
df = pd.DataFrame(data_list)
df.head(10)

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,quoteCount,conversationId,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ArjunSinghPanam/status/136...,2021-02-24T09:23:35+00:00,The world progresses while the Indian police a...,The world progresses while the Indian police a...,1364506249291784198,"{'username': 'ArjunSinghPanam', 'displayname':...",[https://twitter.com/ravisinghka/status/136415...,[https://t.co/es3kn0IQAF],0,0,...,0,1364506249291784198,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,{'url': 'https://twitter.com/RaviSinghKA/statu...,"[{'username': 'narendramodi', 'displayname': '..."
1,https://twitter.com/PrdeepNain/status/13645062...,2021-02-24T09:23:32+00:00,#FarmersProtest \n#ModiIgnoringFarmersDeaths \...,#FarmersProtest \n#ModiIgnoringFarmersDeaths \...,1364506237451313155,"{'username': 'PrdeepNain', 'displayname': 'Pra...",[],[],0,0,...,0,1364506237451313155,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'Kisanektamorcha', 'displayname'..."
2,https://twitter.com/parmarmaninder/status/1364...,2021-02-24T09:23:22+00:00,ਪੈਟਰੋਲ ਦੀਆਂ ਕੀਮਤਾਂ ਨੂੰ ਮੱਦੇਨਜ਼ਰ ਰੱਖਦੇ ਹੋਏ \nਮੇ...,ਪੈਟਰੋਲ ਦੀਆਂ ਕੀਮਤਾਂ ਨੂੰ ਮੱਦੇਨਜ਼ਰ ਰੱਖਦੇ ਹੋਏ \nਮੇ...,1364506195453767680,"{'username': 'parmarmaninder', 'displayname': ...",[],[],0,0,...,0,1364506195453767680,pa,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,
3,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24T09:23:16+00:00,@ReallySwara @rohini_sgh watch full video here...,@ReallySwara @rohini_sgh watch full video here...,1364506167226032128,"{'username': 'anmoldhaliwal', 'displayname': '...",[https://youtu.be/-bUKumwq-J8],[https://t.co/wBPNdJdB0n],0,0,...,0,1364350947099484160,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'ReallySwara', 'displayname': 'S..."
4,https://twitter.com/KotiaPreet/status/13645061...,2021-02-24T09:23:10+00:00,#KisanEktaMorcha #FarmersProtest #NoFarmersNoF...,#KisanEktaMorcha #FarmersProtest #NoFarmersNoF...,1364506144002088963,"{'username': 'KotiaPreet', 'displayname': 'Pre...",[],[],0,0,...,0,1364506144002088963,und,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
5,https://twitter.com/babli_708/status/136450612...,2021-02-24T09:23:05+00:00,Jai jwaan jai kissan #FarmersProtest #ModiIgno...,Jai jwaan jai kissan #FarmersProtest #ModiIgno...,1364506120497360896,"{'username': 'babli_708', 'displayname': 'Babl...",[https://twitter.com/rajeshpunia15/status/1364...,[https://t.co/LXi7d92wwf],0,0,...,0,1364506120497360896,hi,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,{'url': 'https://twitter.com/RajeshPunia15/sta...,
6,https://twitter.com/Varinde17354019/status/136...,2021-02-24T09:22:54+00:00,#FarmersProtest,#FarmersProtest,1364506076272496640,"{'username': 'Varinde17354019', 'displayname':...",[],[],0,0,...,0,1364506076272496640,und,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,
7,https://twitter.com/BitnamSingh/status/1364505...,2021-02-24T09:22:35+00:00,#ModiDontSellFarmers\n#FarmersProtest https://...,#ModiDontSellFarmers\n#FarmersProtest twitter....,1364505995859423234,"{'username': 'BitnamSingh', 'displayname': 'Bi...",[https://twitter.com/jagjitvaheguru/status/136...,[https://t.co/uGQb1O5Jg9],0,0,...,0,1364505995859423234,und,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,{'url': 'https://twitter.com/jagjitvaheguru/st...,
8,https://twitter.com/anmoldhaliwal/status/13645...,2021-02-24T09:22:34+00:00,@mandeeppunia1 watch full video here https://t...,@mandeeppunia1 watch full video here youtu.be/...,1364505991887347714,"{'username': 'anmoldhaliwal', 'displayname': '...",[https://youtu.be/-bUKumwq-J8],[https://t.co/wBPNdJdB0n],0,0,...,0,1364428985074032646,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,[{'thumbnailUrl': 'https://pbs.twimg.com/ext_t...,,,"[{'username': 'mandeeppunia1', 'displayname': ..."
9,https://twitter.com/SatThiara/status/136450589...,2021-02-24T09:22:11+00:00,#FarmersProtest https://t.co/ehd5FBSZGx,#FarmersProtest twitter.com/borisjohnson/s…,1364505896576053248,"{'username': 'SatThiara', 'displayname': 'Sat ...",[https://twitter.com/borisjohnson/status/13642...,[https://t.co/ehd5FBSZGx],0,0,...,0,1364505896576053248,und,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,{'url': 'https://twitter.com/BorisJohnson/stat...,


In [5]:
df.tail(10)

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,quoteCount,conversationId,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers
117397,https://twitter.com/rupindr79/status/136004022...,2021-02-12T01:37:13+00:00,Now Farmers Agitation is no longer confined to...,Now Farmers Agitation is no longer confined to...,1360040229265022979,"{'username': 'rupindr79', 'displayname': 'ਰੁ ਪ...",[],[],0,31,...,4,1360040229265022979,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
117398,https://twitter.com/bali_mandeep/status/136004...,2021-02-12T01:37:12+00:00,Kisan Ekta Zindabaad ✊\n#FarmersProtest,Kisan Ekta Zindabaad ✊\n#FarmersProtest,1360040222986178563,"{'username': 'bali_mandeep', 'displayname': 'M...",[],[],0,0,...,0,1360040222986178563,hi,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,
117399,https://twitter.com/Satveer22950341/status/136...,2021-02-12T01:37:06+00:00,#FarmersProtest \n\n#MahapanchayatRevolution,#FarmersProtest \n\n#MahapanchayatRevolution,1360040199493799936,"{'username': 'Satveer22950341', 'displayname':...",[],[],0,0,...,0,1360040199493799936,und,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,,
117400,https://twitter.com/PushpSamra/status/13600401...,2021-02-12T01:37:05+00:00,The first Mahapanchayat of Punjab. The revolut...,The first Mahapanchayat of Punjab. The revolut...,1360040195786067969,"{'username': 'PushpSamra', 'displayname': 'Pus...",[],[],0,43,...,3,1360040195786067969,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
117401,https://twitter.com/lovehazran1/status/1360040...,2021-02-12T01:37:04+00:00,#BJPGovtDictatingTwitter #MahapanchayatRevolut...,#BJPGovtDictatingTwitter #MahapanchayatRevolut...,1360040192090857475,"{'username': 'lovehazran1', 'displayname': '#F...",[],[],0,1,...,0,1360040192090857475,pa,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,
117402,https://twitter.com/rickyrickstir/status/13600...,2021-02-12T01:37:02+00:00,#FarmersProtest #KisanAndolan #KisaanMajdoorEk...,#FarmersProtest #KisanAndolan #KisaanMajdoorEk...,1360040182771163138,"{'username': 'rickyrickstir', 'displayname': '...",[],[],0,0,...,0,1360040182771163138,und,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,
117403,https://twitter.com/PunjabTak/status/136004014...,2021-02-12T01:36:53+00:00,PM मोदी की अपील के बीच संयुक्त किसान मोर्चा का...,PM मोदी की अपील के बीच संयुक्त किसान मोर्चा का...,1360040146402373637,"{'username': 'PunjabTak', 'displayname': 'Punj...",[https://youtu.be/aG3qHGwoYag],[https://t.co/AzZNOGI8BX],0,0,...,0,1360040146402373637,hi,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
117404,https://twitter.com/ish_kayy/status/1360040134...,2021-02-12T01:36:50+00:00,United we stand.\nDivided we fall\n#Mahapancha...,United we stand.\nDivided we fall\n#Mahapancha...,1360040134230556678,"{'username': 'ish_kayy', 'displayname': 'ishy'...",[],[],0,65,...,6,1360040134230556678,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
117405,https://twitter.com/TV9Bharatvarsh/status/1360...,2021-02-12T01:36:49+00:00,"सिंघु बॉर्डर पर लंबी लड़ाई की तैयारी, किसानों ...","सिंघु बॉर्डर पर लंबी लड़ाई की तैयारी, किसानों ...",1360040127679000577,"{'username': 'TV9Bharatvarsh', 'displayname': ...",[https://www.tv9hindi.com/india/farmers-protes...,[https://t.co/bkjh7WXc0w],0,1,...,1,1360040127679000577,hi,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,,
117406,https://twitter.com/SikhVibes/status/136004012...,2021-02-12T01:36:49+00:00,"@Kisanektamorcha We are with you, keep the mor...","@Kisanektamorcha We are with you, keep the mor...",1360040127146430470,"{'username': 'SikhVibes', 'displayname': 'Sikh...",[],[],2,19,...,2,1360038291471388672,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'Kisanektamorcha', 'displayname'..."


In [6]:
df.isna().sum()

url                     0
date                    0
content                 0
renderedContent         0
id                      0
user                    0
outlinks                0
tcooutlinks             0
replyCount              0
retweetCount            0
likeCount               0
quoteCount              0
conversationId          0
lang                    0
source                  0
sourceUrl             912
sourceLabel           912
media               89298
retweetedTweet     117407
quotedTweet         75971
mentionedUsers      79373
dtype: int64

We can see that `media`, `retweetedTweet`, `quotedTweet` and `mentionedUsers` have a relevant amount of `None` value in its columns. As we answer the questions, we'll avaluate if it will impact us or just ignore them.

---

# Question 1

**Las top 10 fechas donde hay más tweets. Mencionar el usuario (username) que más publicaciones tiene por cada uno de esos días. Debe incluir las siguientes funciones**

```python
Returns: 
[(datetime.date(1999, 11, 15), "LATAM321"), (datetime.date(1999, 7, 15), "LATAM_CHI"), ...]
```


Let's take the coulmns we're interested in and work it out from there. For this question we'll isolate `date` and `user` columns in a separete dataframe to not mess up with the original data.

In [7]:
df_q1 = df[["date", "user"]]
df_q1.head(5)

Unnamed: 0,date,user
0,2021-02-24T09:23:35+00:00,"{'username': 'ArjunSinghPanam', 'displayname':..."
1,2021-02-24T09:23:32+00:00,"{'username': 'PrdeepNain', 'displayname': 'Pra..."
2,2021-02-24T09:23:22+00:00,"{'username': 'parmarmaninder', 'displayname': ..."
3,2021-02-24T09:23:16+00:00,"{'username': 'anmoldhaliwal', 'displayname': '..."
4,2021-02-24T09:23:10+00:00,"{'username': 'KotiaPreet', 'displayname': 'Pre..."


Comparing the inputs to the expected output, it's clear that the values in the dataframe need to be processed first.

Starting with the `date` column. Taking the date on index 0, we have `2021-02-24T09:23:35+00:00`, in general terms: YYYY-MM-DD HH:MM:SS UTC time. So what needs to be done is, convert the date values on the `date` column into a datetime object. 

In [8]:
df_q1["date"] = df_q1.apply(lambda x: datetime.strptime(x.date, "%Y-%m-%dT%H:%M:%S%z").date(), axis=1)
df_q1.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_q1["date"] = df_q1.apply(lambda x: datetime.strptime(x.date, "%Y-%m-%dT%H:%M:%S%z").date(), axis=1)


Unnamed: 0,date,user
0,2021-02-24,"{'username': 'ArjunSinghPanam', 'displayname':..."
1,2021-02-24,"{'username': 'PrdeepNain', 'displayname': 'Pra..."
2,2021-02-24,"{'username': 'parmarmaninder', 'displayname': ..."
3,2021-02-24,"{'username': 'anmoldhaliwal', 'displayname': '..."
4,2021-02-24,"{'username': 'KotiaPreet', 'displayname': 'Pre..."


Furthermore, we have to process the `user` column to filter information we are interested in: the value of `username` key.

In [9]:
df_q1["user"] = df_q1.apply(lambda x: x.user["username"], axis=1)
df_q1.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_q1["user"] = df_q1.apply(lambda x: x.user["username"], axis=1)


Unnamed: 0,date,user
0,2021-02-24,ArjunSinghPanam
1,2021-02-24,PrdeepNain
2,2021-02-24,parmarmaninder
3,2021-02-24,anmoldhaliwal
4,2021-02-24,KotiaPreet


Now, lets first verify the total amount of tweets made in all the dates. Then, will sort the results and tke the top 10 of them.

In [10]:
grouped_date = df_q1.groupby("date", as_index=False)
top_dates = grouped_date.count().sort_values(by=["user"], ascending=False).head(10)
top_dates

Unnamed: 0,date,user
0,2021-02-12,12347
1,2021-02-13,11296
5,2021-02-17,11087
4,2021-02-16,10443
2,2021-02-14,10249
6,2021-02-18,9625
3,2021-02-15,9197
8,2021-02-20,8502
11,2021-02-23,8417
7,2021-02-19,8204


With the above results, we know the list of dates that the answer must contain.

With the same idea, we'll group `date` and `username` coulmns together.

In [11]:
grouped_user = df_q1.groupby(["date", "user"], as_index=False).size().sort_values(by="size", ascending=False)
grouped_user.head(10)

Unnamed: 0,date,user,size
35219,2021-02-19,Preetm91,267
33193,2021-02-18,neetuanjle_nitu,195
26577,2021-02-17,RaaJVinderkaur,185
7536,2021-02-13,MaanDee08215437,178
2740,2021-02-12,RanbirS00614606,176
42691,2021-02-21,Surrypuria,161
33396,2021-02-18,rebelpacifist,153
34733,2021-02-19,KaurDosanjh1979,138
48696,2021-02-23,Surrypuria,135
18540,2021-02-15,jot__b,134


We've got to be careful now. It might look like the final answer, but what the output really shows is most tweeted dates made by a single user. Witch is different from what we are looking for.

So, now we have to filter the dates based on the results showed at `top_dates` dataframe, sort the results then drop the duplicated dates.

In [12]:
dates = top_dates.date.to_list()
top_users = df_q1.groupby(["date", "user"], as_index=False).size().set_index("date").loc[dates].sort_values(by="size", ascending=False).reset_index().drop_duplicates(subset=["date"])
top_users

Unnamed: 0,date,user,size
0,2021-02-19,Preetm91,267
1,2021-02-18,neetuanjle_nitu,195
2,2021-02-17,RaaJVinderkaur,185
3,2021-02-13,MaanDee08215437,178
4,2021-02-12,RanbirS00614606,176
7,2021-02-23,Surrypuria,135
8,2021-02-15,jot__b,134
9,2021-02-16,jot__b,133
12,2021-02-14,rebelpacifist,119
23,2021-02-20,MangalJ23056160,108


We can see that the two previous results are slightly different from each another. If you look closely, you'll notice that `2021-02-21` is in the first but not int the seconde answer.

Furthermore, we'll sort the answer by the most to less tweeted ones.

In [13]:
top_dates_dict = top_dates.set_index("date").to_dict()

q1_result_df = top_users.copy()
q1_result_df["tweets"] = [top_dates_dict["user"][d] for d in top_users.date]
q1_result_df = q1_result_df.sort_values(by="tweets", ascending=False).reset_index(drop=True)
q1_result_df

Unnamed: 0,date,user,size,tweets
0,2021-02-12,RanbirS00614606,176,12347
1,2021-02-13,MaanDee08215437,178,11296
2,2021-02-17,RaaJVinderkaur,185,11087
3,2021-02-16,jot__b,133,10443
4,2021-02-14,rebelpacifist,119,10249
5,2021-02-18,neetuanjle_nitu,195,9625
6,2021-02-15,jot__b,134,9197
7,2021-02-20,MangalJ23056160,108,8502
8,2021-02-23,Surrypuria,135,8417
9,2021-02-19,Preetm91,267,8204


Lastly, we can have turn the answer into the right form.

In [14]:
date = q1_result_df["date"]
user = q1_result_df["user"]
q1_result = [(date.iloc[i], user.iloc[i]) for i in range(10)]
q1_result

[(datetime.date(2021, 2, 12), 'RanbirS00614606'),
 (datetime.date(2021, 2, 13), 'MaanDee08215437'),
 (datetime.date(2021, 2, 17), 'RaaJVinderkaur'),
 (datetime.date(2021, 2, 16), 'jot__b'),
 (datetime.date(2021, 2, 14), 'rebelpacifist'),
 (datetime.date(2021, 2, 18), 'neetuanjle_nitu'),
 (datetime.date(2021, 2, 15), 'jot__b'),
 (datetime.date(2021, 2, 20), 'MangalJ23056160'),
 (datetime.date(2021, 2, 23), 'Surrypuria'),
 (datetime.date(2021, 2, 19), 'Preetm91')]

## Q1 Memory Usage

In [15]:
!python -m memory_profiler q1_memory.py "farmers-protest-tweets-2021-2-4.json"

Filename: q1_memory.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     8     76.9 MiB     76.9 MiB           1   @profile
     9                                         def q1_memory(file_path: str) -> List[Tuple[datetime.date, str]]:
    10                                         
    11                                             # Get data from json file
    12   1204.7 MiB      0.0 MiB           2       with open(file_path, 'r') as file:
    13     76.9 MiB      0.0 MiB           1           data_list = []
    14   1204.7 MiB   -977.5 MiB      117408           for line in file:
    15   1204.7 MiB     99.0 MiB      117407               object_json = json.loads(line)
    16   1204.7 MiB  -1007.7 MiB      117407               data_list.append(object_json)
    17   1204.7 MiB      0.0 MiB           1       del object_json
    18                                         
    19                                             # Transform it in a dataframe
    20   1283.9 