# Week 3 - Practical

<hr style="border: 5px solid #61223b;" />

## Section 1: Recap on Pandas Basics
If you need clarifications about the Pandas API you can type the function name followed by ? to get inline documentation:

In [1]:
import pandas as pd
pd.DataFrame?

[0;31mInit signature:[0m
[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m:[0m [0;34m'Axes | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumns[0m[0;34m:[0m [0;34m'Axes | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m:[0m [0;34m'Dtype | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcopy[0m[0;34m:[0m [0;34m'bool | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'None'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series ob

Other resources for Pandas:
* Pandas API referece: [https://pandas.pydata.org/docs/reference/index.html](https://pandas.pydata.org/docs/reference/index.html)
* DataFrame API: [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
* Series API: [https://pandas.pydata.org/docs/reference/api/pandas.Series.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)
* GroupBy API: [https://pandas.pydata.org/docs/reference/groupby.html](https://pandas.pydata.org/docs/reference/groupby.html)

Small dataframes can simply be printed to the console. However, large dataframes cannot be printed to the console and we have higher level commands to inspect its contents. To get information on the schema of the DataFrames, we can use the info function:

In [2]:
df = pd.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [6, 7, 8, 9, 10],
    }, index=range(5)
    )
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int64
 1   B       5 non-null      int64
dtypes: int64(2)
memory usage: 208.0 bytes


For the first part of this practical we will perform some basic data analyis on the World Cup soccer logs from the first practical. However, this time the input data has been sampled and formatted as a csv file.
* When passing custom column labels to the `names` argument, [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) assumes the csv file does not contain column header data.
* The values passed to `na_values` are all considered missing data values in the dataset. `na_values` accepts scalars, lists, and dictionaries (for specifying per-column missing values characters). In this case we only have the `-` character indicating missing values for all columns in our data.

In [5]:
#! cd data && tar -xf wc_day6_1_sample.tar.bz2
column_labels = ['ClientID', 'Date', 'Time', 'URL', 'ResponseCode', 'Size']
log_df = pd.read_csv("./data/wc_day6_1_sample.csv", names=column_labels, na_values=['-'], encoding="unicode_escape")
log_df.head()

Unnamed: 0,ClientID,Date,Time,URL,ResponseCode,Size
0,1044,30/Apr/1998,22:46:12,/images/11104.gif,200.0,508.0
1,10871,01/May/1998,12:10:53,/images/ligne.gif,200.0,169.0
2,11012,01/May/1998,12:17:30,/english/individuals/player111503.htm,200.0,7027.0
3,11435,01/May/1998,13:15:13,/french/frntpage.htm,304.0,0.0
4,12128,01/May/1998,13:30:21,/english/images/nav_sitemap_off.gif,304.0,


<hr style="border: 1px solid #b79962;" />

### Recap of indexing operators
*(If you were at the lecture and paid attention skip this recap section)*

| Operator | Allowed inputs | Examples | Multi-axis selection? | 
|:---|:---|:---|:---|
|`.loc[]`| <ul><li> single label</li><li> list or array of labels </li><li> a slice object with labels (stop-inclusing!)</li><li>boolean array</li><li>callable function with one argument</ul>|<ul><li> `.loc[3]` or `.loc["a"]`</li><li> `.loc[["a", "b", "c"]]` </li><li> `.loc["a":"c"]`</li><li>`.loc[[True, False, True]]`</li><li>`.loc[lambda s: s > 0]`</ul> | Yes `.iloc[1, "2"]` (first argument = rows, second argument = columns, third, ...)|
|`.iloc[]`|<ul><li> single integer</li><li> list or array of integers </li><li> a slice object with ints (right-exclusive!)</li><li>boolean array</li><li>callable function with one argument</ul>|<ul><li> `.iloc[2]`</li><li> `.iloc[[0, 1, 2]]` </li><li> `.iloc[0:2]`</li><li>`.loc[[True, False, True]]`</li><li>`.loc[lambda s: s > 0]`</ul>| Yes: `.iloc[1, 2]` (first argument = rows, second argument = columns, third, ...) |
|`[]`| <ul><li> single label</li><li> list or array of labels </li><li> a slice object with labels (stop-inclusing or exclusive!)</li><li>boolean array</li><li>callable function with one argument</ul>|<ul><li> `df[3]` or `df["a"]`</li><li> `df[["a", "b", "c"]]` </li><li> `df["a":"c"]`</li><li>`df[[True, False, True]]`</li><li>`df[lambda s: s > 0]`</ul> | No: <ul><li>single value = match column labels</li><li> List or array of labels will match columns labels. List or array of ints, will return indexed rows.</li><li> Slice = match rows</li></ul> |


<hr style="border: 1px solid #b79962;" />

### SQL-like queries
A SQL statement typically selects a subset of rows from a table that match a given criteria. This is known as the [Selection operator in Relational Algebra](https://en.wikipedia.org/wiki/Selection_(relational_algebra)). Similarly we can perform selections in Pandas using boolean indexing. Boolean indexing refers to a technique where you can use a list of boolean values to filter a dataframe.

For example, lets say we only want entries from `01/May/1998`. To do this we can create a boolean array like:

In [25]:
is_may1st = log_df['Date'] == '01/May/1998'
is_may1st.head(2)

0    False
1     True
Name: Date, dtype: bool

Now we can filter our DataFrame by passing it the boolean array.

In [26]:
may1_df = log_df[is_may1st]
may1_df.head()

Unnamed: 0,ClientID,Date,Time,URL,ResponseCode,Size
1,10871,01/May/1998,12:10:53,/images/ligne.gif,200.0,169.0
2,11012,01/May/1998,12:17:30,/english/individuals/player111503.htm,200.0,7027.0
3,11435,01/May/1998,13:15:13,/french/frntpage.htm,304.0,0.0
4,12128,01/May/1998,13:30:21,/english/images/nav_sitemap_off.gif,304.0,
5,13649,01/May/1998,14:55:01,/images/hm_anime_e.gif,200.0,15609.0


Or we can directly do this by passing in the boolean clause to the DataFrame:

In [32]:
may1_df = log_df[log_df['Date'] == '01/May/1998']
may1_df.head()

Unnamed: 0,ClientID,Date,Time,URL,ResponseCode,Size
1,10871,01/May/1998,12:10:53,/images/ligne.gif,200.0,169.0
2,11012,01/May/1998,12:17:30,/english/individuals/player111503.htm,200.0,7027.0
3,11435,01/May/1998,13:15:13,/french/frntpage.htm,304.0,0.0
4,12128,01/May/1998,13:30:21,/english/images/nav_sitemap_off.gif,304.0,
5,13649,01/May/1998,14:55:01,/images/hm_anime_e.gif,200.0,15609.0


While selection is used for filtering rows, [projection is the relational algebra operator](https://en.wikipedia.org/wiki/Projection_%28relational_algebra%29) used to select columns.
With the indexing operators
For example to only keep the `URL` and `ResponseCode` column we would use:

In [92]:
url_codes = log_df[['URL', 'ResponseCode']]
url_codes.head(5)

Unnamed: 0,URL,ResponseCode
0,/images/11104.gif,200.0
1,/images/ligne.gif,200.0
2,/english/individuals/player111503.htm,200.0
3,/french/frntpage.htm,304.0
4,/english/images/nav_sitemap_off.gif,304.0


Pandas also allows you to group the DataFrame by values in any column. For example to group requests by `ResponseCode` you can execute:

In [28]:
grouped = log_df.groupby('ResponseCode')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb78b8fae50>

The `groupby` method returns a `DataFrameGroupBy` or `SeriesGroupBy` object based on whether it was called on a `DataFrame` or `Series` object. These datatypes contain a number of groups where each group is a dataframe or series.

In [31]:
print("Number of groups: ", grouped.ngroups)
print("Group keys: ", grouped.groups.keys())
print("Sample of group 200:\n", grouped.get_group(200).head())

Number of groups:  7
Group keys:  dict_keys([200.0, 206.0, 302.0, 304.0, 400.0, 404.0, 500.0])
Sample of group 200:
    ClientID         Date      Time                                    URL  \
0      1044  30/Apr/1998  22:46:12                      /images/11104.gif   
1     10871  01/May/1998  12:10:53                      /images/ligne.gif   
2     11012  01/May/1998  12:17:30  /english/individuals/player111503.htm   
5     13649  01/May/1998  14:55:01                 /images/hm_anime_e.gif   
6     15006  01/May/1998  16:14:32  /english/images/comp_bu_group_off.gif   

   ResponseCode     Size  
0         200.0    508.0  
1         200.0    169.0  
2         200.0   7027.0  
5         200.0  15609.0  
6         200.0   1557.0  


Pandas also has many useful commands that can be used on groupby objects using the data [split-apply-combine paradigm](https://pandas.pydata.org/pandas-docs/dev/user_guide/groupby.html).
This paradigm is very common and follows the following principles:
1. **Splitting** the data into groups based on some criteria.
2. **Applying** a function to each group independently.
3. **Combining** the results into a data structure.

Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the **apply step**, we might wish to do one of the following:
* **Aggregation:** compute a summary statistic (or statistics) for each group. For example, sums or means, or compute group sizes / counts.
* **Transformation:** perform some group-specific computations and return a like-indexed object. For example: normalise data, filling NAs within groups with a value derived from each group.
* **Filtration:** discard some groups, according to a group-wise computation that evaluates True or False. For example, discard data that belongs to groups with only a few members or filter out data based on the group sum or mean.

<hr style="border: 1px solid #b79962;" />

### Exercises:
1. How many rows are present in `log_df`?
2. What are the number of non-NaN values for each column in `log_df`?
3. What are the URLs in rows 85 through 90?
4. Use selection to print the number of requests that had HTTP return code `404`.
5. Use `.query()` to select all responses that had a 4** return code.
6. Use `.value_counts()` to show all unique client error responses (400-499) and server error responses (500-599) and their corresponding counts.
7. What is the average reponse size?
8. Using `groupby`, get the number of log entries per `ResponseCode` value.
9. Add the string `https://www.mywebsite.co.za/` in front of all `URL` column values.
10. How many of the various response codes occurred on the 30th April and how many on 1st May? (Hint: Use a multi_grouped DataFrame.) On which day did the servers produce the most errors?
11. How many GIFs were delivered to the clients?

<br/><br/>
<hr style="border: 5px solid #61223b;" />

## Section 2: Applying functions to rows, column

So far we have been using SQL-style operators to process the data. However to do data cleaning or more complex analysis we often need to apply functions on rows or columns of a DataFrame.

For example, consider the columns `Date` and `Time` in `log_df`. It would be useful if we could combine these columns and create `datetime` which can be used for filtering, grouping, etc.

To create a DateTime column we will use Pandas helper function `to_datetime()`. This function takes a string and converts it to a datetime object. To call this on every row of the DataFrame, we use the `apply()` function. `apply()` takes two arguments, the first a function to apply and secondly `axis` which indicates if this should be applied on every row (`axis=1`) or column (`axis=0`).

In [94]:
log_df['DateTime'] = pd.to_datetime(log_df.apply(lambda row: row['Date'] + ' ' + row['Time'], axis=1))
log_df.head(5)

Unnamed: 0,ClientID,Date,Time,URL,ResponseCode,Size,DateTime
0,1044,30/Apr/1998,22:46:12,/images/11104.gif,200.0,508.0,1998-04-30 22:46:12
1,10871,01/May/1998,12:10:53,/images/ligne.gif,200.0,169.0,1998-05-01 12:10:53
2,11012,01/May/1998,12:17:30,/english/individuals/player111503.htm,200.0,7027.0,1998-05-01 12:17:30
3,11435,01/May/1998,13:15:13,/french/frntpage.htm,304.0,0.0,1998-05-01 13:15:13
4,12128,01/May/1998,13:30:21,/english/images/nav_sitemap_off.gif,304.0,,1998-05-01 13:30:21


This might take a minute to execute due to datetime parsing. This is an opportunity to stretch or take a look at Pandas' [Timestamp documentation](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html) or [Python's `datetime.datetime`](https://docs.python.org/3/library/datetime.html#datetime.datetime) object which are basically equivalent.

If we want to group by hour, we can now use the timestamp functionality instead of doing any string parsing:

In [95]:
hour_grouped = log_df.groupby(lambda row: log_df['DateTime'][row].hour)
hour_grouped.ngroups

24

Finally, note that you can apply operations on each group using the `apply()` method. This is similar to the apply on the DataFrame we saw earlier except the `apply()` method is now called once per group.

#### Exercises:
1. Create a new column that contains the `ResponseSize` in kilobytes converted from the byte value in the `Size` column.
2. What is the average file size for images (.gif or .jpg or .jpeg files) which had response code 200?
3. What is the standard deviation?
4. Using a regular expression, extract the file name and type from the URL column. Add these as two new columns called `Filename` and `Filetype` to the `log_df` dataframe. For the sake of this exercise assume that for the URL `https://www.mywebsite.co.za/images/11104.gif`, `11104` would be the file name and `.gif` would be the file type
5. Generate a histogram of traffic to the site every half-hour and plot this. [Use `plotly.express` for plotting the histogram](https://plotly.com/python/histograms/).
6. Is there any correlation between client IDs and hours of the day at which they visit the website. Get 100 random client IDs from the dataset and plot a scatter plot that shows the hours of the day these clients sent requests. [Use `plotly.express` for plotting the scatter plot](https://plotly.com/python/line-and-scatter/).
7. **(Optional)** Use the logs from another day (`./data/wc_day91_1_log.tar.bz2)`) and merge it with the data from the (`../week1/data/wc_day6_1_log.tar.bz2)`. Repeat Exercise 5 from this section with the merged data. How similar or different are the results? (Hint: Use UNIX command line tools first get a csv file and then load it into Pandas.)

<br/><br/>
<hr style="border: 5px solid #61223b;" />

## Section 3: Groupby and data merging

For the exercises in this section use the following files:
```
./data/mentors.csv
./data/marks.csv
./data/advanced_tutors.csv
```

For your code to match the memo execute the following cell in your notebook as setup for this section's exercises:

In [14]:
total_students = 215
discontinues = 8
# column labels for the mentor.csv file
field_names = ['Date', 'StartTime', 'EndTime', 'Mentor', 'Student', 'StudentID', 'Attended'] 

### Exercises:

1. Load the mentor data into Pandas and perform some basic EDA to understand the data. For example, find out what the unique values are for each column? Does the data contain missing values? Does the data contain incorrect values?
2. How many unique attended mentor session were attended?
3. How many unique students attended at least one mentor session?
4. Assuimg there were a total of 215 students taking the course, what are the percentages of all students (215) that attended more than one, three, and 5 mentor session? (Use the `groupby` and `query` functions).
5. Create a bar chart of the number of students and their session attendance counts? In other words Y-axis=Number of students and X-axis=Session counts. Use the following to create the plot.
```
fig = px.bar(<your data>,
                labels={<your column name of the data to plot>:'Number of Sessions Attended', 'value':"Nr. of Students"},
                text_auto=True,
                title="Students per session attendance frequency")
fig.update_layout(bargap=0.2)
fig.show()
```
6. How many sessions did each mentor hold?
7. Use `dataframe.to_datetime()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html)) to calculate the number of students that made bookings per week. Plot a bargraph that looks like the following:
![Bargraph](./images/practical_sec3_question7.png)
8. How many sessions occured per week throughout the semester?
9. What were the average mentor session attendance per week throughout the semester?
10. What is the number of session size per week (for both noon and everning sessions)? Try and recreate the following figure:
![grouped bar chart](./images/practical_sec3_question10.png)
11. Merge the mentor session data with the advanced mentor session data (`./data/advanced_tutors.csv`) such that you will still be able to differentiate between the type of sessions.
12. What is the overlap between the students that attended the normal and advanced mentor sessions?
13. Merge the `./data/marks.csv` file with your current dataframe.
14. Is there a significant difference in the marks between students that went to normal mentor sessions and advanced mentor sessions?
15. Add a categorical column `Category` to your dataframe with categories `["Fail", "Cont.", "Pass", "Distinction"]` for groups based on the students' marks: `[0,39,40-49,50-74,75-100]`. Recreate the following pie chart:
![pie chart](./images/practical_sec3_question15.png)

```
# Helper code to create the Pie chart
fig = px.pie(<your merged dataframe>, names='Category', title="Student Performance Overall",
             color="Category",
             color_discrete_map={
                "Distinction": "blue",
                "Pass": "green",
                "Cont.": "orange",
                "Fail": "red"})
fig.update_traces(textinfo='percent+label')
fig.show()
```


## References and futher reading:
* [Pandas documentation on indexing and selection data](https://pandas.pydata.org/docs/user_guide/indexing.html#)
* [Pandas documentation on advanced data selection and hierarchical indexing (MultiIndex)](https://pandas.pydata.org/docs/user_guide/advanced.html#)
* [Plotly](https://plotly.com/python/)
* [Plotly express](https://plotly.com/python/plotly-express/)