# Chapter II - Pandas (Wrangling)

In this notebook we will cover further essential skills in data manipulation (also called wrangling):

- Transforming dataframe columns using `.apply()` and `.map()` functions
- Renaming columns
- Grouping entries using `.groupby()` and aggregating them using `.agg()`

In [17]:
import pandas as pd

## Motivation

For this we will use a sample of 50,000 tweets from UK museums from my thesis... me here being Ellen :)

In [18]:
df = pd.read_csv('../data/sample_museum_tweets.tsv', sep='\t')
df

Unnamed: 0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0
...,...,...,...,...,...,...,...
49995,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0
49996,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0
49997,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0
49998,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0


## Section (1): Accessing manipulations

### A) Setting indexes

We can give the dataframe a bit more structure:
- the `id` column can be transformed into the dataframe's index, thus enabling us e.g. to select a tweet by id.

This is done using the `set_index()` method.
- Two arguments are important: `drop` (which specifies whether to get rid of the previous index) and `inplace` (which _directly_ modifies the dataframe)


In [19]:
df.set_index('id', drop=True, inplace=True)

We can now access a row by selecting the identifier and applying the `loc` operator as we saw last time:

In [20]:
df.loc[7789471450859]

ts                                            2019-06-29 07:42:00+00:00
museum_account                                             hull_museums
original_text         Our Hands on History Museum is open today 12no...
is_retweet                                                        False
is_reply                                                           True
interactions_count                                                  1.0
Name: 7789471450859, dtype: object

❓ [Question]
- What data type does this return?
- What if you want to access more than one tweet?

In [21]:
# your answer here:

The index can be reset done using the `reset_index()` method.
- Again the `drop` and `inplace` operators are relevant. 
- We can also choose the name we want to give to the index using the `names` argument

In [22]:
df.reset_index(drop=False, inplace=True, names='id')
df

Unnamed: 0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0
...,...,...,...,...,...,...,...
49995,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0
49996,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0
49997,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0
49998,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0


✏️ [Ex.1] 
- ✏️ Display the first 10 elements of the dataframe.
- ✏️ Using functions seen in the previous notebook, convert the `ts` column into a `datetime` value.
- ✏️ Again, using functions we have already seen, create the columns `year`, `month`, `day`, and `hour` that record when the tweet was published.

In [23]:
# your solution here:

### B) Setting column names

An operation on dataframes that you'll find yourself doing very often is to rename the columns. The first way of renaming columns is by manipulating directly the dataframe's index via the `columns` property.
We can change the column names by assigning to `columns` a list having as values the new column names.

**NB**: the size of the list and new number of colums must match!

In [24]:
df.columns

Index(['id', 'ts', 'museum_account', 'original_text', 'is_retweet', 'is_reply',
       'interactions_count'],
      dtype='object')

In [25]:
df.columns = ['id', 'ts', 'museum_account', 'tweet_text', 'is_retweet', 'is_reply',
       'interactions_count', 'year', 'month', 'day', 'hour']

ValueError: Length mismatch: Expected axis has 7 elements, new values have 11 elements

In [26]:
# let's check that the change did take place
df.head()

Unnamed: 0,id,ts,museum_account,original_text,is_retweet,is_reply,interactions_count
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0


The second way of renaming colums is to use the method `rename()` of a dataframe. The `columns` parameter takes a dictionary of mappings between old and new column names.

```python
mapping_dict = {
    "old_column1_name": "new_column1_name",
    "old_column2_name": "new_column2_name",
}
```

In [None]:
# Let's change the column name back: `tweet_text` => `tweet`
df = df.rename(columns={"tweet_text": "tweet"})
df

Unnamed: 0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0
...,...,...,...,...,...,...,...
49995,6273812294446,2022-01-05 15:51:00+00:00,dtcswansea,#OnThisDay 1934 Dylan: ‘my evenings are given ...,False,False,5.0
49996,2076749205202,2021-10-11 16:31:47+00:00,museumcromwell,@EdeBorrett @joinLordGrey @17thCenturyLady @Ki...,False,True,4.0
49997,8445227010972,2019-10-02 12:02:10+00:00,hullmaritime,@HullTruck Thank you so much for your continue...,False,True,4.0
49998,8904441389462,2019-03-16 02:26:52+00:00,museumcromwell,@MuseumCromwell @MirandaMalins @17thCenturyLad...,False,True,1.0


Note that, here too the `inplace` parameter exists. The above cell is identical to 
```python
df.rename(columns={"original_text": "tweet"}, inplace=True)
```

❓ [Question]
- In which cases is it more convenient to use the second method over the first?

_Your answer here:_


## Section (2): Transformation

A typical problem you will face in data wrangling is the necessity to transform some data you have been given according into another form. If that transformation is regular enough, you may want to write a function that does such conversion.
It is possible to effectively apply that function to your `pandas` data.

The two main methods used to manipulate and transform values in a dataframe are:
- `map()`: an element-wise method for simple conversions, applied to _one_ column
- `apply()`: suited for more complex operations, which can be applied to a _whole row_.

In this section we'll be using both to enrich our datasets with useful information (useful for exploration, for later visualizations, etc.).

The structure is the following:
- the `map()` or `apply()` methods are applied to a `pd.Series` or `pd.DataFrame`
- they return a `pd.Series` which you will typically want to use to create a new column

Typically:

```python
df['NewColumn'] = df['OldColumn'].apply(some_function)
```
which is equivalent to
```python
df['NewColumn'] = df['OldColumn'].map(some_function)
```

For example, say we want to extract the length of the tweet:
- (1) We can take the `text` column and apply the `len()` native Python function:

    ❗ Note that you call the function **WITHOUT** any argument: just `len`

In [28]:
df['tweet_length'] = df['tweet'].apply(len)
df.head()

Unnamed: 0,id,ts,museum_account,tweet,is_retweet,is_reply,interactions_count,tweet_length
0,5798944681378,2021-07-22 19:38:34+00:00,ancienthousemus,"@AncientHouseMus @artukdotorg Ugh, no wonder h...",False,True,0.0,61
1,8702741370361,2019-01-02 08:25:06+00:00,livcathedral,Looking for a fun challenge to take as a famil...,False,False,4.0,246
2,4436751452541,2020-04-06 10:07:47+00:00,samuseums,@samuseums A great choice to show the beautifu...,False,True,2.0,118
3,7789471450859,2019-06-29 07:42:00+00:00,hull_museums,Our Hands on History Museum is open today 12no...,False,True,1.0,144
4,1508119645064,2022-05-18 11:12:56+00:00,ecmfcm,@CynthiaTheBaker @BarnsleyMuseums @SMTrust @Da...,False,True,1.0,128


- (2) which is equivalent to:

In [None]:
df['tweet_length'] = df['tweet'].map(len)
df.head()

Here, we used an existing function. But we could have used one we wrote specifically —-- which can be useful to deal with exceptions/rare cases/errors, etc.

- (3) By defining a function explicitely:

In [None]:
def extractTweetLength(tweet_text):
    return(len(tweet_text))


df['tweet_length'] = df['tweet'].map(extractTweetLength)
df.head()

- (4) Or by using an anonymous `lambda` function:

In [None]:
df['tweet_length'] = df['tweet'].map(lambda x: len(x))
df.head()

✏️ [Ex.2] 
To see this in action, use `apply()` or `map()` to create new columns called:
- ✏️ `tweet_link`, knowing that the default link is : `https://x.com/i/web/status/ + tweet_id
- ✏️ `tweet_nbwords`, which counts the number of words in the tweet
- ✏️ `tweet_mentions`, which lists the number of mentions (ie. words that start with @) separated by commas

    For this last question, be careful about things that are not mentions, as in this sample tweet:
    ```
    "As @stephanie__27 mentionned, our last version (v3@main) is out ! Reach out to steph@org.org for more"
    ```

    There are several ways you could tackle this problem. Feel free to use any!

- ✏️ `tweet_nbmentions`, which counts the number of mentions
- ❗ Export the created dataframe into a `.csv` that we will reuse later. 

In [None]:
# Your solution here:



## Section (3): Aggregation

<img src='../data/png.png' width='600px'>

### A) Grouping

To group a `DataFrame`, one uses the `.groupby()` method:

`df.groupby('columnName')`

❗ Important: The object returned by `groupby` is a `DataFrameGroupBy` **not** a normal `DataFrame`:

In [None]:
grouped = df.groupby('year')
grouped

❗ To make is usable, we need to specify _what_ to do to each group of entries.

This is the point of **aggregation**.


### B) Aggregating


- `agg` is used to pass an aggregation function to be applied to each group resulting from `groupby`.


For example, if we want to count how many tweets there are by year, we can pass the `count` argument:



In [None]:
grouped.agg('count')

- Note that this a bit redundent: all columns have the same number of tweets, regardless of the tweet length / mentions / etc.
- Furthermore, this is similar to the `.value_counts()` method we have already seen:

In [None]:
df.value_counts('year')

- This is because we need to _tune_ which operation we want for each column.
- Some will benefit from counting, some from averaging, some from summing, etc.


The way we specify this is by using a dictionary and passing it as the argument of the `.agg()` method.
This has the double advantage of:
- tuning the aggregation function to each column;
- removing all un-necessary columns.


In [None]:
df.groupby('year').agg({'id':'count',
                       'tweet_length':'mean'})

- To make this more readable and keep track of what the values are, it is recommended to `.rename()` your columns.

Recalling what we have seen earlier, this is how you can do it:


In [None]:
df.groupby('year').agg({'id':'count',
                       'tweet_length':'mean'}).rename(columns={'id':'nb_tweets',
                                                       'tweet_length':'tweet_length_mean'}
                       )

❓ Now that you have aggregated the columns using some function, what data type do you end up with?

_Your answer here:_

---

❗ The pre-existing functions you can use in the aggregation are:

- `count`: Number of non-NA values
- `sum`: Sum of non-NA values
- `mean`: Mean of non-NA values
- `median`: Arithmetic median of non-NA values
- `std`, `var`: standard deviation and variance
- `min`, `max`: Minimum and maximum of non-NA values


❗Just like with `apply()` and `map()`, you can use any function you define:

```python
df.groupby('groupingColumn').agg({'columnName':some_function})
```

✏️ [Ex.3] Use the aggregation method to determine, after a grouping by year:
- ✏️ The average number of mentions
- ✏️ The total number of tweeted characters
- ✏️ The standard deviation of tweeting hours
- ✏️ (Difficult) The number of tweets that have more than 20 interactions.
- ✏️ Change the aggregated column names to keep track of what was done.


In [None]:
# Your solution here:

### C) Multiple grouping and aggregation

As a final remark, note that we can do the **grouping** and the **aggregation** on multiple conditions.

Say for example that I want to regroup all tweets that are done on the same weekday _and_ the same hour.
- Simply pass the two columns as a list in the `groupby()`function:

In [None]:
df.groupby(['year', 'month']).agg({'id':'count',
                       'tweet_length':'mean'}).rename(columns={'id':'nb_tweets',
                                                       'tweet_length':'tweet_length_mean'}
                       )

- You can also choose different aggregation functions within a `.groupby()`:

In [None]:
df.groupby('year').agg(
    {'interactions_count':[
            'count',
            'mean',
            'min',
            'max',
            'std',
            'var'
        ]
    }
)

✏️ [Ex.4] Expanding what was done in [Ex.3], calculate for each month of 2020:
- ✏️ The mean number of mentions
- ✏️ The total number of mentions
- ✏️ The mean number of tweet interactions
- ✏️ The standard deviation number of tweet interactions

In [None]:
# Your solution here: