[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/databyjp/axi_da_transform_demos/blob/main/DAT2_Week_4_2.ipynb)
# DA Transform - Week 4 session 2

In [None]:
import pandas as pd

#### Pandas groupby operation

In [None]:
import seaborn as sns
df = sns.load_dataset("tips")
df.head()

In [None]:
fdf = df[df['size'] > 2]
fdf.head()

In [None]:
ffdf = fdf[fdf['smoker'] == 'Yes']
ffdf.head()

In [None]:
fdf = df[(df['size'] > 2) & (df['smoker'] == 'Yes')]
fdf.iloc[:, 3:5]

In [None]:
df.head()

In [None]:
df['day'].unique()

In [None]:
gdf = df.groupby('day').mean()
gdf.head()

In [None]:
gdf.iloc[:2, :2]

##### Multi-index groups

In [None]:
# Aggregate categorical data
df.groupby(['size']).median()

In [None]:
# What happens if we group this data by multiple columns?
df.groupby(['size', 'sex']).median()

#### Joining multiple DataFrames

##### `concatenate`

![Concatenate](https://pandas.pydata.org/docs/_images/merging_concat_basic.png)

In [None]:
chars = ['A', 'B', 'C', 'D']
ind1 = range(4)
ind2 = range(4, 8)

list_1 = [[i + str(j) for i in chars] for j in ind1]
df1 = pd.DataFrame(list_1, columns=chars, index=ind1)

list_2 = [[i + str(j) for i in chars] for j in ind2]
df2 = pd.DataFrame(list_2, columns=chars, index=ind2)

display(df1)
display(df2)

In [None]:
# Let's concatenate these two:
pd.concat([df2, df1])

##### `join`

![join](https://pandas.pydata.org/docs/_images/merging_merge_on_key.png)

##### Types of `join`

Inner / Left / Right / Full Outer

![Join types](https://raw.githubusercontent.com/amartinson193/SQL_Checkered_Flag_Join_Diagrams/main/checkered_flag_diagram_pg1.png)

Source: https://github.com/amartinson193/SQL_Checkered_Flag_Join_Diagrams




In [None]:
chars = ['A', 'B', 'C', 'D']
ind1 = [1, 2, 3, 5, 6]
ind2 = [1, 2, 4]

list_1 = [[i + str(j) for i in chars[:2]] for j in ind1]
df_l = pd.DataFrame(list_1, columns=chars[:2], index=ind1)

list_2 = [[i + str(j) for i in chars[2:]] for j in ind2]
df_r = pd.DataFrame(list_2, columns=chars[2:], index=ind2)

# list_3 = [[i + str(j) for i in ['E', 'F']] for j in ind2]
# df_3 = pd.DataFrame(list_3, columns=['E', 'F'], index=ind2)

display(df_l)
display(df_r)
# display(df_3)

In [None]:
# Let's try joining
df_r.join(df_l, how='left')
# df_b = df_a.join(df_3)

#### `inplace=True` operations 

In [None]:
df = sns.load_dataset("tips")
df.head()

In [None]:
df_new = df.rename({'size': 'party size'}, axis=1)
df_new.head()

In [None]:
df.head()  # Notice the column names

In [None]:
df.rename({'size': 'party size'}, axis=1, inplace=True)

In [None]:
df.head()

### Data types - Statistical

- **Quantitative** (Interval & Ratio)
- **Categorical** (Nominal & Ordinal)


What type of data is each column below?

In [None]:
df.head()

### Function as an argument

In [None]:
def add_title(var):
  return f'The Honourable {var}'.title()

def make_larger(var):
  return var + 1000

def outer_function(vars, func_in):
  list_out = list()
  for v in vars:
    list_out.append(func_in(v))
  return list_out

names = ['peNNy', 'enRiQue', 'xAvIer']
# print(outer_function(names, add_title))

numbers = [15, 10, 200]
# print(outer_function(numbers, make_larger))

print(outer_function(numbers, str))  # Can you tell what is going on here?

df['total_bill'].apply(make_larger)

### Try / Except blocks

Code blocks like this:

```
try:
  something
Except as e:
  do this if error
```

Allow your code to keep executing even if an error occurs!

#### For instance:

In [None]:
# 1 + ' more time'
# print('Done!')

In [None]:
try:
  1 + ' more time'
except Exception as e:
  print('Something went wrong!')
  print('Error', e)
print('Done!')

### Data Cleaning / Wrangling

#### What is it? / Why?

Prevent garbage-in, garbage-out

Read more: [Why clean data?](https://www.tableau.com/learn/articles/what-is-data-cleaning)

#### Dealing with missing data

##### How to identify missing data

In [None]:
# If this causes an error try restarting the notebook kernel & running again
!pip install fsspec s3fs &> /dev/null

In [None]:
import pandas as pd
df = pd.read_csv('s3://databyjp/data/titanic_train.csv')

In [None]:
df

In [None]:
df.isna()

In [None]:
df.isna().any()

In [None]:
df.info()

In [None]:
df['Age'].isna()

In [None]:
df[df['Age'].isna()]

In [None]:
df[(df['Age'].isna()) & (df['Pclass'] == 1)]

##### How to clean missing data

In [None]:
# Drop rows
df_cln = df[-df['Age'].isna()]  # Or, use: df[df['Age'].notna()]

In [None]:
df_cln

In [None]:
# Replace the values with a representative value
df.loc[df['Age'].isna(), 'Age'] = df['Age'].median()

df[df['Age'].isna()]

### Pair plots & interpretation

Remember what we learned about correlation

![Correlation examples](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/2560px-Correlation_examples2.svg.png)

#### Pair plot example

![Pair plot example](https://seaborn.pydata.org/_images/pairplot_3_0.png)

### APIs / Web programming

#### Request / response cycle

![Request/response](https://miro.medium.com/max/1400/1*2UbC5pSRyjGmF1ezB9hvYg.png)

#### HTTP response codes

You might have seen images like this:

![404 example](https://www.howtogeek.com/wp-content/uploads/2018/05/2018-06-03-2.png?trim=1,1&bg-color=000&pad=1,1)

This is a **response status code**:

HTTP response status codes indicate whether a specific HTTP request has been successfully completed. Responses are grouped in five classes:

    Informational responses (100–199)
    Successful responses (200–299)
    Redirection messages (300–399)
    Client error responses (400–499)
    Server error responses (500–599)

Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status


## Phase 1 Project

Due 25 June 

(**Week 6 is project week**)

### Before you start the project...

Complete:
- The remainder of the "**Welcome**" section, if you haven't yet, and:
- Go through the "**Appendix: Bash and Git**" section over this week and next

The project will ask you to:
- Create your own Jupyter notebook
- Use git and Github by creating your own "repository" for the project, and
- Create a 'non-technical presentation'

#### Suggestions:
- Create a GitHub account if you haven't yet done so
- Follow the tutorials to set up a *local* Python work environment (install git / Anaconda)
- Familiarise yourself with the "Bash" shell - terminal on Mac, or Git Bash on Windows (*NOT Command Prompt*)
- Follow the Git tutorials over the next couple of weeks piece by piece


### Project information

Read the "Online Milestones Instructions" section. But:
- "Blogging" is **optional** i.e. not assessed
- Marking rubrics -> Project checklist / guidance

Important: Check out the ["Templates and Examples"](https://github.com/learn-co-curriculum/dsc-project-template) repo

## Lab solutions

Visible through course materials: Click on the GitHub link on each lab page, and check out the "solution" branch in the GH repository.

## Read more

- [Pandas documentation - selecting subsets of data](https://pandas.pydata.org/pandas-docs/version/1.0.2/getting_started/intro_tutorials/03_subset_data.html)
- [Lambda functions](https://towardsdatascience.com/lambda-functions-with-practical-examples-in-python-45934f3653a8)
- [Map v Apply v Applymap](https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff)
- [Pairplots](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
- [Overview of HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)