## Pandas

In [None]:
import pandas as pd

### Read data file
**pd.read_csv('file_path', delimiter=',', skiprows=n, header=m, index_col=k)**  
- Specify the file path as the first argument.
- Specify the delimiter with the delimiter argument. It can be omitted if the delimiter is a comma (,).
If the delimiter is a tab: ‘\t’, if it is blank: ‘\s+’  
You may use sep argument instead of delimiter argument  
- Specify the number of rows to skip in the skiprows argument. Can be omitted if there is no line to skip.    
- Specify the column label with the header argument. None if there is no column label.  
- Specify the row label with the index_col argument. Omit it if there is no row level.

※ Check the contents of the file before reading the data file  
Right-click the data file you want to open in JupyterLab, Open With > Editor  

(Note) In Pandas, the data type may be different for each column. (In NumPy, all elements have the same data type)

In [None]:
df = pd.read_csv('data/user_data1.csv',delimiter=',')
display(df.head())

In the following explanation, let df be a DataFrame variable. 

### Condition extraction of DataFrame

**df[condition]**  
Example of how to write a condition  
- `df[df["col_name"] == value1]` Extract rows where the specified column value matches the value1. 
- `df[df["col_name"] > value2]` Extract rows where the specified column value is greater than the value2   

Specifying multiple conditions:   
- AND(&): df[(cond_1)&(cond_2)]  
- OR(|): df[(cond_1)|(cond_2)]

In [None]:
display(df[df["age"] > 50]) # Show over 50 years old

In [None]:
display(df[(df["gender"] == "M")&(df["age"] < 30 )]) # Show male and under 30 years old

### Add row to DataFrame

**pd.concat([df, ser.to_frame().T],ignore_index=True)**

- Make the index of the row to be added the same as the column label of the DataFrame
- An index is automatically assigned with the argument `ignore_index = True`
- Series is converted to a 2-d Dataframe using to_frame(). Since it yields a column, so transpose it with ".T" to make it to a row.
-　Seriesをto_frame()を用いて２次元のDataFrameに変換する。さらに、そのままでは列になっているので、「.T」を用いて転置をし、行にする。
- Combine the new dataframe with the dataframe using pd.concat.

In [None]:
# Convert the row to be added to Series and make the column label the same as DataFrame
ser = pd.Series([1013,20,"M","student"],index=df.columns) 
print(ser)
print(df.columns)

df2 = pd.concat([df, ser.to_frame().T],ignore_index=True)
display(df2)

### Add column to DataFrame

**df["new_col"] = list or Series**

In [None]:
# Add a new column "year" to represent the year of birth
df["year"] = 2020 - df["age"]
display(df)

### Delete rows / columns of DataFrame

Delete columns:  
**df.drop(columns="col_name")**  or  **df.drop("col_name",axis=1)**

Delete rows:  
**df.drop(index=row_num)**  or  **df.drop(row_num)**

In [None]:
df = df.drop(columns="year")
display(df)

### Merge DataFrame

How to combine two DataFrames with common columns into one DataFrame.  
The common columns considered in the merge is called "key".

**pd.merge(df1,df2,on="col_name",how="types_of_merge")**

- The DataFrame of the first argument is the table on the left, and the DataFrame of the second argument is the table on the right.
- on: Column or index level names to join on. These must be found in both DataFrames.   
- how: Type of merge to be performed.
```
how="inner": inner join. Extract and merge only the common rows of the keys of the two DataFrames
how="outer": outer join. Merge using all rows in two DataFrames
how="left": use only keys from left frame
how="right": use only keys from right frame
```

In [None]:
df2 = pd.read_csv('data/user_data2.csv',delimiter=',')
display(df2.head())

# The common column is "id". Merge by outer join.
df_new = pd.merge(df,df2,on="id",how="outer")
display(df_new)

### Group DataFrame

Data can be aggregated for each category group and the total, maximum, minimum, average, etc. can be calculated.

**df.groupby("col_name").statistic_function**

Statistical functions are sum(), max(), min(), mean(), etc.

In [None]:
#Aggregate data for each occupation and calculate the average value of each column (numerical data column only)
df.groupby("occupation").mean()

### Sort

**df.sort_values(by=["col_name1"],ascending=True)**  

In the argument `ascending`, specify True for ascending order and False for descending order for each corresponding column.  

Sort by multiple columns:  
**df.sort_values(by=["col1","col2"],ascending=[True,True])**  

In [None]:
display(df_new.sort_values(by=["age"],ascending=True))# Sort by age

### Write to CSV file

**df.to_csv('file_path')**

When the argument `header = False` is added, does not write out the column names.    
When the argument `index=False` is added, does not wtite row names (index).   
When the argument `columns=["col1, col2"]` is added, only the specified columns can be written.

In [None]:
df_new.to_csv('data/new_user_data.csv')