# Chapter II - Pandas (Wrangling)

In this notebook we will cover further essential skills in data manipulation (also called wrangling):

- Transforming dataframe columns using `.apply()` and `.map()` functions
- Renaming columns
- Grouping entries using `.groupby()` and aggregating them using `.agg()`

In [1]:
import pandas as pd

## Motivation

For this we will use a sample of 50,000 tweets from UK museums from my thesis... me here being Ellen :)

In [23]:
df = pd.read_csv('../data/sample_museum_tweets.tsv', sep='\t')
df

Unnamed: 0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0
...,...,...,...,...,...,...,...
49995,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0
49996,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0
49997,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0
49998,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0


## Section (1): Accessing manipulations

### A) Setting indexes

We can give the dataframe a bit more structure:
- the `id` column can be transformed into the dataframe's index, thus enabling us e.g. to select a tweet by id.

This is done using the `set_index()` method.
- Two arguments are important: `drop` (which specifies whether to get rid of the previous index) and `inplace` (which _directly_ modifies the dataframe)


In [11]:
df.set_index('id', drop=False)

Unnamed: 0_level_0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
5798944681378,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
8702741370361,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
4436751452541,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
7789471450859,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
1508119645064,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0
...,...,...,...,...,...,...,...
6273812294446,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0
2076749205202,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0
8445227010972,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0
8904441389462,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0


In [12]:
df.set_index('id', drop=True, inplace=True)

In [30]:
df.head()

Unnamed: 0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0


We can now access a row by selecting the identifier and applying the `loc` operator as we saw last time:

In [36]:
df.loc[1:3]

Unnamed: 0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0


❓ [Question]
- What data type does this return?
- What if you want to access more than one tweet?

In [21]:
# your answer here:

The index can be reset done using the `reset_index()` method.
- Again the `drop` and `inplace` operators are relevant. 
- We can also choose the name we want to give to the index using the `names` argument

In [56]:
df

Unnamed: 0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0
...,...,...,...,...,...,...,...
49995,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0
49996,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0
49997,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0
49998,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0


In [22]:
df.reset_index(drop=False, inplace=True, names='id')
df

Unnamed: 0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0
...,...,...,...,...,...,...,...
49995,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0
49996,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0
49997,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0
49998,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0


In [57]:
df

Unnamed: 0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0
...,...,...,...,...,...,...,...
49995,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0
49996,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0
49997,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0
49998,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0


✏️ [Ex.1] 
- ✏️ Display the first 10 elements of the dataframe.
- ✏️ Using functions seen in the previous notebook, convert the `ts` column into a `datetime` value.
- ✏️ Again, using functions we have already seen, create the columns `year`, `month`, `day`, and `hour` that record when the tweet was published.

In [None]:
# your solution here:
df['ts'] = pd.to_datetime(df['ts'])

df['year'] = df['ts'].dt.year
df['month'] = df['ts'].dt.month_name() 

df['month'] = df['ts'].dt.month_name() 

df['day'] = df['ts'].dt.day
df['hour'] = df['ts'].dt.hour

### B) Setting column names

An operation on dataframes that you'll find yourself doing very often is to rename the columns. The first way of renaming columns is by manipulating directly the dataframe's index via the `columns` property.
We can change the column names by assigning to `columns` a list having as values the new column names.

**NB**: the size of the list and new number of colums must match!

In [69]:
df.columns

Index(['id', 'ts', 'museum_account', 'original_text', 'is_retweet', 'is_reply',
       'interactions_count', 'year', 'month', 'day', 'hour'],
      dtype='object')

In [70]:
df.columns = ['id', 'ts', 'museum_account', 'tweet_text', 'is_retweet', 'is_reply',
       'interactions_count', 'year', 'month', 'day', 'hour']

In [71]:
# let's check that the change did take place
df.head()

Unnamed: 0,id,ts,museum_account,tweet_text,is_retweet,is_reply,interactions_count,year,month,day,hour
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0,2021,July,22,19
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0,2019,January,2,8
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0,2020,April,6,10
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0,2019,June,29,7
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0,2022,May,18,11


The second way of renaming colums is to use the method `rename()` of a dataframe. The `columns` parameter takes a dictionary of mappings between old and new column names.

```python
mapping_dict = {
    "old_column1_name": "new_column1_name",
    "old_column2_name": "new_column2_name",
}
```

In [76]:
# Let's change the column name back: `tweet_text` => `tweet`
df = df.rename(columns={"tweet_text": "tweet"})
df

Unnamed: 0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,year,month,day,hour
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0,2021,July,22,19
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0,2019,January,2,8
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0,2020,April,6,10
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0,2019,June,29,7
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0,2022,May,18,11
...,...,...,...,...,...,...,...,...,...,...,...
49995,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0,2022,January,5,15
49996,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0,2021,October,11,16
49997,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0,2019,October,2,12
49998,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0,2019,March,16,2


Note that, here too the `inplace` parameter exists. The above cell is identical to 
```python
df.rename(columns={"original_text": "tweet"}, inplace=True)
```

❓ [Question]
- In which cases is it more convenient to use the second method over the first?

_Your answer here:_


## Section (2): Transformation

A typical problem you will face in data wrangling is the necessity to transform some data you have been given according into another form. If that transformation is regular enough, you may want to write a function that does such conversion.
It is possible to effectively apply that function to your `pandas` data.

The two main methods used to manipulate and transform values in a dataframe are:
- `map()`: an element-wise method for simple conversions, applied to _one_ column
- `apply()`: suited for more complex operations, which can be applied to a _whole row_.

In this section we'll be using both to enrich our datasets with useful information (useful for exploration, for later visualizations, etc.).

The structure is the following:
- the `map()` or `apply()` methods are applied to a `pd.Series` or `pd.DataFrame`
- they return a `pd.Series` which you will typically want to use to create a new column

Typically:

```python
df['NewColumn'] = df['OldColumn'].apply(some_function)
```

which is equivalent to
```python
df['NewColumn'] = df['OldColumn'].map(some_function)
```

For example, say we want to extract the length of the tweet:
- (1) We can take the `text` column and apply the `len()` native Python function:

    ❗ Note that you call the function **WITHOUT** any argument: just `len`

In [89]:
df['tweet_length'] = df['tweet'].apply(len)

In [90]:
df['tweet_length'] = df['tweet'].apply(len)
df.head()

Unnamed: 0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,year,month,day,hour,tweet_length
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0,2021,July,22,19,61
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0,2019,January,2,8,246
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0,2020,April,6,10,118
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0,2019,June,29,7,144
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0,2022,May,18,11,128


- (2) which is equivalent to:

In [92]:
df['tweet_length'] = df['tweet'].map(len)
df.head()

Unnamed: 0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,year,month,day,hour,tweet_length
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0,2021,July,22,19,61
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0,2019,January,2,8,246
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0,2020,April,6,10,118
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0,2019,June,29,7,144
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0,2022,May,18,11,128


Here, we used an existing function. But we could have used one we wrote specifically —-- which can be useful to deal with exceptions/rare cases/errors, etc.

- (3) By defining a function explicitely:

In [95]:
def extractTweetLength(tweet_text):
    return(len(tweet_text))


df['tweet_length'] = df['tweet'].map(extractTweetLength)
df.head()

Unnamed: 0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,year,month,day,hour,tweet_length
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0,2021,July,22,19,61
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0,2019,January,2,8,246
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0,2020,April,6,10,118
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0,2019,June,29,7,144
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0,2022,May,18,11,128


- (4) Or by using an anonymous `lambda` function:

In [96]:
df['tweet_length'] = df['tweet'].map(lambda x: len(x))
df.head()

Unnamed: 0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,year,month,day,hour,tweet_length
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0,2021,July,22,19,61
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0,2019,January,2,8,246
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0,2020,April,6,10,118
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0,2019,June,29,7,144
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0,2022,May,18,11,128


✏️ [Ex.2] 
To see this in action, use `apply()` or `map()` to create new columns called:
- ✏️ `tweet_link`, knowing that the default link is : `https://x.com/i/web/status/ + tweet_id
- ✏️ `tweet_nbwords`, which counts the number of words in the tweet
- ✏️ `tweet_mentions`, which lists the mentions (ie. words that start with @)

    For this last question, be careful about things that are not mentions, as in this sample tweet:
    ```
    "As @stephanie__27 mentionned, our last version (v3@main) is out ! Reach out to steph@org.org for more"
    ```

    There are several ways you could tackle this problem. Feel free to use any!

- ✏️ `tweet_nbmentions`, which counts the number of mentions
- ❗ Export the created dataframe into a `.csv` that we will reuse later. 

In [None]:
# Your solution here:
def createsTweetLink(lmwkfdjhfsdlkjh):
    return(f"https://x.com/i/web/status/{lmwkfdjhfsdlkjh}")

df['tweet_link'] = df['id'].apply(createsTweetLink)

def getNumberWords(tweet_text):
    return(len(tweet_text.split()))

df['tweet_nbwords'] = df['tweet'].apply(getNumberWords)

def extractMentions(tweet_text):
    list_of_words = tweet_text.split()
    mentions = []
    for word in list_of_words:
        if word.startswith('@'):
            mentions.append(word)
    return mentions

df['tweet_mentions'] = df['tweet'].apply(extractMentions)

df['tweet_nbmentions'] = df['tweet_mentions'].apply(len)
df


Unnamed: 0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,year,month,day,hour,tweet_length,tweet_link,tweet_nbwords,tweet_mentions,tweet_nbmentions
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0,2021,July,22,19,61,https://x.com/i/web/status/5798944681378,9,"[@AncientHouseMus, @artukdotorg]",2
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0,2019,January,2,8,246,https://x.com/i/web/status/8702741370361,36,[],0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0,2020,April,6,10,118,https://x.com/i/web/status/4436751452541,21,[@samuseums],1
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0,2019,June,29,7,144,https://x.com/i/web/status/7789471450859,24,[],0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0,2022,May,18,11,128,https://x.com/i/web/status/1508119645064,15,"[@CynthiaTheBaker, @BarnsleyMuseums, @SMTrust,...",4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0,2022,January,5,15,100,https://x.com/i/web/status/6273812294446,15,[],0
49996,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0,2021,October,11,16,218,https://x.com/i/web/status/2076749205202,22,"[@EdeBorrett, @joinLordGrey, @17thCenturyLady,...",11
49997,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0,2019,October,2,12,67,https://x.com/i/web/status/8445227010972,11,[@HullTruck],1
49998,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0,2019,March,16,2,200,https://x.com/i/web/status/8904441389462,23,"[@MuseumCromwell, @MirandaMalins, @17thCentury...",9


## Section (3): Aggregation

<img src='../data/png.png' width='600px'>

### A) Grouping

To group a `DataFrame`, one uses the `.groupby()` method:

`df.groupby('columnName')`

❗ Important: The object returned by `groupby` is a `DataFrameGroupBy` **not** a normal `DataFrame`:

In [145]:
len(df.loc[(df['year']==2019)&(df['month']=='April')])

1410

In [143]:
grouped = df.groupby(['year','month']).count()
grouped.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,day,hour,tweet_length,tweet_link,tweet_nbwords,tweet_mentions,tweet_nbmentions
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019,April,1410,1410,1410,1410,1410,1410,1410,1410,1410,1410,1410,1410,1410,1410
2019,August,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224
2019,December,920,920,920,920,920,920,920,920,920,920,920,920,920,920
2019,February,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224,1224
2019,January,1227,1227,1227,1227,1227,1227,1227,1227,1227,1227,1227,1227,1227,1227


In [None]:
df.groupby('year').count()
df.groupby('year').agg('count')

Unnamed: 0_level_0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,month,day,hour,tweet_length,tweet_link,tweet_nbwords,tweet_mentions,tweet_nbmentions
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931
2020,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065
2021,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925
2022,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079


❗ To make is usable, we need to specify _what_ to do to each group of entries.

This is the point of **aggregation**.


In [152]:
df.groupby('year', as_index=False)['interactions_count']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x3146c81d0>

### B) Aggregating


- `agg` is used to pass an aggregation function to be applied to each group resulting from `groupby`.


For example, if we want to count how many tweets there are by year, we can pass the `count` argument:



In [153]:
grouped = df.groupby('year')

In [None]:
grouped.agg('count')

Unnamed: 0_level_0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,month,day,hour,tweet_length,tweet_link,tweet_nbwords,tweet_mentions,tweet_nbmentions
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931,14931
2020,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065,16065
2021,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925,13925
2022,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079,5079


- Note that this a bit redundent: all columns have the same number of tweets, regardless of the tweet length / mentions / etc.
- Furthermore, this is similar to the `.value_counts()` method we have already seen:

In [None]:
df.value_counts('year')

- This is because we need to _tune_ which operation we want for each column.
- Some will benefit from counting, some from averaging, some from summing, etc.


The way we specify this is by using a dictionary and passing it as the argument of the `.agg()` method.
This has the double advantage of:
- tuning the aggregation function to each column;
- removing all un-necessary columns.


In [163]:
df.groupby(['year']).agg({'id':'count',
                       'tweet_length':'mean'})

Unnamed: 0_level_0,id,tweet_length
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2019,14931,155.407407
2020,16065,155.545721
2021,13925,160.064417
2022,5079,163.778303


- To make this more readable and keep track of what the values are, it is recommended to `.rename()` your columns.

Recalling what we have seen earlier, this is how you can do it:


In [177]:
df.groupby('year').agg({'id':'count','tweet_length':'mean'}).rename(columns={'id':'nb_tweets','tweet_length':'tweet_length_mean'})

Unnamed: 0_level_0,nb_tweets,tweet_length_mean
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2019,14931,155.407407
2020,16065,155.545721
2021,13925,160.064417
2022,5079,163.778303


❓ Now that you have aggregated the columns using some function, what data type do you end up with?

_Your answer here:_

---

❗ The pre-existing functions you can use in the aggregation are:

- `count`: Number of non-NA values
- `sum`: Sum of non-NA values
- `mean`: Mean of non-NA values
- `median`: Arithmetic median of non-NA values
- `std`, `var`: standard deviation and variance
- `min`, `max`: Minimum and maximum of non-NA values


❗Just like with `apply()` and `map()`, you can use any function you define:

```python
df.groupby('groupingColumn').agg({'columnName':some_function})
```

✏️ [Ex.3] Use the aggregation method to determine, after a grouping by year:
- ✏️ The average number of mentions
- ✏️ The total number of tweeted characters
- ✏️ The standard deviation of tweeting hours
- ✏️ (Difficult) The number of tweets that have more than 20 interactions.
- ✏️ Change the aggregated column names to keep track of what was done.


In [183]:
df.head()

Unnamed: 0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,year,month,day,hour,tweet_length,tweet_link,tweet_nbwords,tweet_mentions,tweet_nbmentions
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0,2021,July,22,19,61,https://x.com/i/web/status/5798944681378,9,"[@AncientHouseMus, @artukdotorg]",2
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0,2019,January,2,8,246,https://x.com/i/web/status/8702741370361,36,[],0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0,2020,April,6,10,118,https://x.com/i/web/status/4436751452541,21,[@samuseums],1
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0,2019,June,29,7,144,https://x.com/i/web/status/7789471450859,24,[],0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0,2022,May,18,11,128,https://x.com/i/web/status/1508119645064,15,"[@CynthiaTheBaker, @BarnsleyMuseums, @SMTrust,...",4


In [195]:
sum(df['interactions_count']>20)

5831

In [None]:
# Your solution here:

def countMoreTwenty(input_series):
    return sum(input_series>20)

df.groupby('year').agg({'tweet_nbmentions':'mean',
                        'tweet_length':'sum',
                        'hour':'std',
                        'interactions_count':countMoreTwenty}).rename(columns={'tweet_nbmentions':'tweet_avg_mentions',
                                                                       'tweet_length':'tweet_sum_text',
                                                                       'hour':'hour_std',
                                                                       'interactions_count':'moretwenty_interactions'})


df.groupby('year').agg({'tweet_nbmentions':'mean',
                        'tweet_length':'sum',
                        'hour':'std',
                        'interactions_count':lambda x: sum(x>20)}).rename(columns={'tweet_nbmentions':'tweet_avg_mentions',
                                                                       'tweet_length':'tweet_sum_text',
                                                                       'hour':'hour_std',
                                                                       'interactions_count':'moretwenty_interactions'})

Unnamed: 0_level_0,tweet_avg_mentions,tweet_sum_text,hour_std,moretwenty_interactions
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019,1.336548,2320388,4.349074,1534
2020,1.350202,2498842,4.163337,2040
2021,1.45465,2228897,4.186734,1697
2022,1.417799,831830,4.282999,560


### C) Multiple grouping and aggregation

As a final remark, note that we can do the **grouping** and the **aggregation** on multiple conditions.

Say for example that I want to regroup all tweets that are done on the same weekday _and_ the same hour.
- Simply pass the two columns as a list in the `groupby()`function:

In [205]:
df.groupby(['year', 'month']).agg({'tweet_nbmentions':'mean',
                        'tweet_length':'sum',
                        'hour':'std',
                        'interactions_count':lambda x: sum(x>20)}).rename(columns={'tweet_nbmentions':'tweet_avg_mentions',
                                                                       'tweet_length':'tweet_sum_text',
                                                                       'hour':'hour_std',
                                                                       'interactions_count':'moretwenty_interactions'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tweet_avg_mentions,tweet_sum_text,hour_std,moretwenty_interactions
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019,April,1.268794,209668,4.325505,159
2019,August,1.276144,192699,4.391132,108
2019,December,1.115217,145364,4.388219,119
2019,February,1.361928,185432,4.41505,110
2019,January,1.358598,185381,4.326072,152
2019,July,1.425581,203160,4.222477,116
2019,June,1.462323,198014,4.359242,134
2019,March,1.33168,219925,4.374108,161
2019,May,1.349522,212614,4.343569,130
2019,November,1.418976,187461,4.31276,115


In [None]:
df.groupby(['year', 'month']).agg({'id':'count',
                       'tweet_length':'mean'}).rename(columns={'id':'nb_tweets',
                                                       'tweet_length':'tweet_length_mean'}
                       )

- You can also choose different aggregation functions within a `.groupby()`:

In [212]:
df.groupby('year').agg(
    {'interactions_count':['min','max','count','mean','std',countMoreTwenty]}
)

Unnamed: 0_level_0,interactions_count,interactions_count,interactions_count,interactions_count,interactions_count,interactions_count
Unnamed: 0_level_1,min,max,count,mean,std,countMoreTwenty
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2019,0.0,36595.0,14931,18.165093,330.35548,1534
2020,0.0,109160.0,16065,36.063492,1219.822236,2040
2021,0.0,18462.0,13925,17.276338,207.690155,1697
2022,0.0,20654.0,5079,22.543414,370.430061,560


✏️ [Ex.4] Expanding what was done in [Ex.3], calculate for each month of 2020:
- ✏️ The mean number of mentions
- ✏️ The total number of mentions
- ✏️ The mean number of tweet interactions
- ✏️ The standard deviation number of tweet interactions

In [243]:
# Your solution here:

result = df.groupby(['year', 'month']).agg({
    'tweet_nbmentions':['mean', 'count'],
    'interactions_count':['mean', 'std']
})
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tweet_nbmentions,tweet_nbmentions,interactions_count,interactions_count
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count,mean,std
year,month,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2019,April,1.268794,1410,17.348936,160.568336
2019,August,1.276144,1224,13.426471,93.313473
2019,December,1.115217,920,22.696739,167.102284
2019,February,1.361928,1224,11.191176,58.015655
2019,January,1.358598,1227,22.145069,139.310362
2019,July,1.425581,1290,23.385271,341.581822
2019,June,1.462323,1274,11.726845,36.706206
2019,March,1.33168,1411,42.523742,979.36273
2019,May,1.349522,1359,13.7844,65.704474
2019,November,1.418976,1191,13.219983,79.60569
