# Pandas1

**[1] Package**<br>
**[2] Pandas data structures**<br>
**[3] Read data from file to DataFrame**<br>
**[4] Sorting**

## [1] Package

In [None]:
!pip install pandas

In [None]:
import pandas as pd

## [2] Pandas data structures
- **Series**: A one-dimensional array-like structure.
- **DataFrame**: A tabular, spreadsheet-like structure.

### [2.1] Series

- **Create a series from a list**

In [None]:
Series1 = pd.Series([4, 7, -5, 3])
Series1

In [None]:
# get value
Series1.values

In [None]:
# get index
Series1.index

- **Create a series and assign an index array**

In [None]:
Series2 = pd.Series([4, 7, -5, 3], index = ["a","b","c","d"])
Series2

In [None]:
Series2.index

### [2.2] DataFrame

- **Create a DataFrame from a dictionary**

In [None]:
# Create a dictionary
data = {"state": ["Ohio","Ohio","Ohio","Nevada","Nevada","Nevada"],
        "year":[2000,2001,2002,2001,2002,2003],
        "pop":[1.5,1.7,3.6,2.4,2.9,3.2]}

In [None]:
df1 = pd.DataFrame(data)
df1

- **Create a DataFrame and assign an index array**

In [None]:
df2 = pd.DataFrame(data, index = ["a", "b", "c", "d", "e", "f"])
df2

In [None]:
df2.shape

- **A DataFrame is a collection of Series**

In [None]:
year_series = df2["year"]

print(year_series)

In [None]:
print(type(year_series))

## Exercise.A

**(A.1) Create a dictionary containing three key-value pairs. Use the keys "exam1", "exam2" and "exam3" and specify the following list as the corresponding values.**

In [None]:
score_list1 = [70, 85, 90, 50, 75]
score_list2 = [80, 65, 85, 60, 80]
score_list3 = [85, 70, 70, 75, 80]

**(A.2) Create a dataframe from the dictionary you obtained in (A.1) with the following list as the index. Display the dataframe.**

In [None]:
ID_list = ['S01', 'S02', 'S03', 'S04', 'S05']

**(A.3) Using the dataframe obtained in (A.2), print the number of rows and columns.**

## [3] Read data from file

### [3.1] Read csv file into Pandas DataFrame

- **Check your notebook’s current working directory**

In [None]:
import os 
os.getcwd()

- **Read csv file**

In [None]:
park_df = pd.read_csv('parks.csv')
# or park_df = pd.read_csv('../dataset/parks.csv')

- **View data**

In [None]:
park_df.head()

- **A summary of a DataFrame**

In [None]:
park_df.info()

## Exercise.B

**(B.1) Read the csv file <code>diabetes.csv</code> as a pandas dataframe. Display the first 10 rows.**

• **Pregnancies**: Number of times pregnant<br>
• **Glucose**: Plasma glucose concentration over 2 hours in an oral glucose tolerance test<br>
• **BloodPressure**: Diastolic blood pressure (mm Hg)<br>
• **SkinThickness**: Triceps skin fold thickness (mm)<br>
• **Insulin**: 2-Hour serum insulin (mu U/ml)<br>
• **BMI**: Body mass index (weight in kg/(height in m)2)<br>
• **DiabetesPedigreeFunction**: Diabetes pedigree function (a function which scores likelihood of
diabetes based on family history)<br>
• **Age**: Age (years)<br>
• **Outcome**: Class variable (0 if non-diabetic, 1 if diabetic)<br>

**(B.2) What are the number of rows and columns of the dataframe you obtained in (A.1)?**

**(B.3) Show a summary of the dataframe, including column names and their data types.**

### [3.2] Descriptive statistics

- **Pandas dtype**

|Pandas dtype|python build-in type|Description|
|:--|:--|:--|
|int64|int|Integer numbers|
|float64|float|Floating point numbers|
|object|str or mixed|Text or mixed numeric and non-numeric values|
|bool|bool|True/False values|
|datetime64|datetime|Date and time values|
|timedelta[ns]|--|Differences between two datetimes|
|category|--|Finite list of text values|

- **Change data type**

In [None]:
park_df["Acres"] = park_df["Acres"].astype(float)
park_df.info()

In [None]:
park_df.Acres = park_df.Acres.astype(float)

- **Descriptive statistics of numercial columns**

In [None]:
park_df.describe()

In [None]:
park_df["Acres"].describe()

- **Value count of categorical columns**

In [None]:
park_df.State.value_counts()

## Exercise.C

**(C.1) Use the dataframe obtained in (B.1). Change the data type of <code>Outcome</code> to <code>object</code>.**

**(C.2) Show the descriptive statistics for the <code>Age</code> column of the dataframe.**

**(C.3) The <code>Outcome</code> column indicates whether an individual has diabetes, where 0 represents non-diabetic and 1 represents diabetic. How many individuals in this dataset have diabetes?**

# [4] Sorting

In [None]:
df = pd.DataFrame({"state": ["Ohio","Ohio","Ohio","Nevada","Nevada","Nevada"],
                    "year":[2000,2001,2002,2001,2002,2003],
                    "pop":[1.5,1.7,3.6,2.4,2.9,3.2]})
df

- **Sort a DataFrame by one column**

In [None]:
df.sort_values(by = "year")

- **Sort a DataFrame in descending order**

In [None]:
df.sort_values(by = "year", ascending = False)

- **Sort a DataFrame by multiple columns**

In [None]:
df.sort_values(by = ["state", "year"])

In [None]:
df.sort_values(by = ["state", "year"], ascending = [True, False])

- **Sort a DataFrame (inplace = True)**

In [None]:
# Option-1: Store the result in a new variable
df_sorted = df.sort_values(by = "year")
df_sorted

In [None]:
# Option-2: Use inplace = True
df.sort_values(by = "year", inplace = True)
df

- **Reset index after sorting**<br>

In [None]:
# Drop the old index
df.reset_index(drop = True)

In [None]:
# Add the old index as an additional column to your DataFrame
df.reset_index(drop = False)

## Exercise.D

**(D.1) Use the dataframe obtained in (C.1). Sort the dataframe in ascending order based on the <code>Age</code> column and store the result in a new variable. Display the result.**

**(D.2) Use the dataframe obtained in (D.1). Reset the index and drop the old one.**