# Pandas 

Pandas will be the primary library that we use for the data analytics.  It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. 
It can 

- Read data from Excel
- Manipulate data 
- Visualize data
- Filter and aggregate data


In this part we cover:

- [Reading Files](#Reading-CSV)
- [Viewing Data](#Viewing-Data)
- [Selection](#Selection)
- [Slicing and Indexing](#Slicing-and-Indexing)
- [Build-in Functions for Summary Statistics](#Built-in-Functions)
- [Add and Delete Column](#Add-and-Delete-Columns)


## Reading CSV

To use `pandas`, we first need to import it because it is not a standard python library.
The syntax is `import pandas as pd` where `pd` is the alias for `pandas`.
After that, we can use `pd` to call functions from `pandas`.

Make sure the file 'Grades_Short.csv' is in the right directory.
A CSV (Comma Separated Values) file is a plain text file that contains a list of data.
Excel can be used to open it as a spreadsheet, but you can also view it using notepad.

Open the CSV file to see what it looks like.

In [1]:
import pandas as pd

# How we read in a pandas dataframe. The header=0 means we want to treat the first row (remember the index starts at 0) as the header row.
df_grade = pd.read_csv("Data/Grades_Short.csv", header=0)
# DataFrame
df_grade

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62
5,Ningyuan,31.0,1,19.0,19,1,8.0,24.0,B,87452
6,Otto,31.5,1,20.0,21,1,9.0,36.0,A,9374


- Before using `pandas`, you have to *import* it first.
- Make sure the file path is correctly specified. Checking `pwd` is always a good idea.
- `Data/Grades_Short.csv` specifies a path: there is a folder called **"Data"** in the working directory (use `pwd` to check) and Grades_short.csv is in the folder. Depending on the OS, you may use \ or /.
- What shall we do if the file is in the parent directory? Hint: use **..**
- `read_csv` is a method/function included in `pandas`. 
- *df* is a pandas **dataframe**. It is a structure (like lists) but more complicated that records data sets. It is very similar to Excel.

In [None]:
pd.read_csv?

## Viewing Data

The `.head()` method of a dataframe is a convenient way to view the data.

One can specify the number of rows to display by using `.head(n)`.

Similarly, `tail()` show the last few rows.

In [3]:
df_grade.head(5)

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62


In [5]:
last_3 = df_grade.tail(3)
last_3

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62
5,Ningyuan,31.0,1,19.0,19,1,8.0,24.0,B,87452
6,Otto,31.5,1,20.0,21,1,9.0,36.0,A,9374


*(Tips*)*: Run `head` and `tail` to see if the data has been read correctly.

An important feature of a dataset is the **column names**, or **header**. We can use the following function to check.

In [9]:
#There are column names
list(df_grade.columns)

['Name',
 'Previous_Part',
 'Participation1',
 'Mini_Exam1',
 'Mini_Exam2',
 'Participation2',
 'Mini_Exam3',
 'Final',
 'Grade',
 'ID']

Why does some functions have *()* (like `head()`) while some others don't (like `columns`)?

It is not hard to remember them individually once you get used to it. In general, those with () involve some computation, called **methods**; those without () are **attributes** of the dataframe.

We can check the data types of each column. This is automatically done by `read_csv` but it is always a good idea to check.

In [6]:
df_grade.dtypes

Name               object
Previous_Part     float64
Participation1      int64
Mini_Exam1        float64
Mini_Exam2          int64
Participation2      int64
Mini_Exam3        float64
Final             float64
Grade              object
ID                  int64
dtype: object

One can also show the row names. This is usually just a numerical array.

In [10]:
#And there are row names
df_grade.index

RangeIndex(start=0, stop=7, step=1)

It is always good to check the **size** of the dataframe using `.shape`

In [6]:
#Get the dimensions of the data frame with shape
df_grade.shape

# num_row, num_col = df_grade.shape

# num_row, num_col

(7, 10)

## Selection

After having some basic information about the whole dataset, we are about to dive into individual subjects or variables.

We can get a specific column by `[variable name]`. This is similar to querying a key in a dictionary.

In [12]:
df_grade['Name'].tail(3)
# df_grade.Name

4      Amelia
5    Ningyuan
6        Otto
Name: Name, dtype: object

In [15]:
df_grade.Name 

0        Liam
1      Olivia
2        Noah
3        Emma
4      Amelia
5    Ningyuan
6        Otto
Name: Name, dtype: object

- Remember to use *''* or *'* for the column name, because it is a string.
- Use square bracket 
- The result is a **series**, effectively a dataframe with one column. So one can follow it with `head()`, for example.
- An equivalent way is to use `.` followed by the column name (without quotations).
This is similar to an *attribute* of the dataframe. In this case, no need to add quotation marks around it.

In [21]:
#You can similarly pick out columns as attributes with the '.'
df_grade.Grade.head() # .head() 前五行

0     A
1     A
2    A-
3     A
4     A
Name: Grade, dtype: object

*(Tips)*: What is the pros and cons of using `[]` versus `.` to select a column?

- \+ more typing...
- \- can select multiple columns; can create new column; can select columns with spaces in the name...

We can **select multiple columns** similarly.

- Don't forget the [] inside. We provide a list of column names.
- The outcome is still a dataframe.

In [13]:
df_grade[["Name", "Grade"]]

# df_grade.Name.Grade 做不到

Unnamed: 0,Name,Grade
0,Liam,A
1,Olivia,A
2,Noah,A-
3,Emma,A
4,Amelia,A
5,Ningyuan,B
6,Otto,A


A column is a series. One can store it in a separate variable for some analysis, very similar to a list.

When you pick out a single column as we have done above the result is a series, which is essentially a one-dimensional dataframe

In [15]:
name_column = df_grade["Name"]
name_column.head()

0      Liam
1    Olivia
2      Noah
3      Emma
4    Amelia
Name: Name, dtype: object

In [16]:
# We index and slice a series through the index
# name_column[0:3] # like a list
name_column[[1,2,4]] # provide a list of indices

1    Olivia
2      Noah
4    Amelia
Name: Name, dtype: object

*(Exercise)*: 

- Select the first 5 rows of the column 'Grade'.
- Select the last but one row of the column 'Grade' and 'Name'.
- Check if all students have participated in Exam1 (check if all the entries in Participation1 are 1). 

## Slicing and Indexing

While selection allows us to pick an entire column, in some cases we want to access a subset of rows or columns.

#### Using Labels

The first method is `.loc` by picking the rows/columns of the specified label.

In [17]:
#Let's look at the data 
df_grade.head()

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62


In [19]:
#Pick out a single entry
df_grade.loc[3,"Name"]
df_grade.loc[1,"Final"]

32.0

What does the above script do?

- It takes two arguments: the labels of the row (numbers, returned by `.index`) and the labels of column (strings, returned by `.columns`).
- Use `:` to pick the entire row/column. The result is a series.

In [21]:
df_grade.loc[0,:]

Name               Liam
Previous_Part      32.0
Participation1        1
Mini_Exam1         19.5
Mini_Exam2           20
Participation2        1
Mini_Exam3         10.0
Final              33.0
Grade                 A
ID                90743
Name: 0, dtype: object

In [22]:
df_grade.loc[0,:].reset_index()

Unnamed: 0,index,0
0,Name,Liam
1,Previous_Part,32.0
2,Participation1,1
3,Mini_Exam1,19.5
4,Mini_Exam2,20
5,Participation2,1
6,Mini_Exam3,10.0
7,Final,33.0
8,Grade,A
9,ID,90743


- One can specify a range using `:`. **Unlike lists, the endpoint is inclusive!**. This is because the `loc` method essentially treat rows and columns by their names, not indices.

In [32]:
#Select contiguous rows and columns 
df_grade.loc[1:3, "Mini_Exam3":"Grade"]

Unnamed: 0,Mini_Exam3,Final,Grade
1,14.0,32.0,A
2,10.5,33.0,A-
3,13.0,34.0,A


- One can also provide a list of labels for the slicing.

In [20]:
#Select none continuguous rows
df_grade.loc[[0,2,4], ["Previous_Part","Grade"]]

Unnamed: 0,Previous_Part,Grade
0,32.0,A
2,30.0,A-
4,30.0,A


#### Using Positions

The second method is `.iloc`. Notice the different behavior regarding the endpoint.

In [34]:
df_grade.iloc[1:3, 1:2]

Unnamed: 0,Previous_Part
1,32.0
2,30.0


In [36]:
# df_grade.iloc[1:3, "Mini_Exam3":"Grade"] # 不能給 string 

It is sometimes easy to use this method to pick a cell directly.

In [41]:
# df_grade.iloc[1,1]

df_grade.loc[1,"Previous_Part"]

32.0

#### Using Boolean

Sometimes we want to pick the subjects (rows) that satisfy certain conditions. 
This is similar to lists.

In [21]:
df_grade

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62
5,Ningyuan,31.0,1,19.0,19,1,8.0,24.0,B,87452
6,Otto,31.5,1,20.0,21,1,9.0,36.0,A,9374


In [22]:
df_good_final = df_grade[df_grade['Final'] > 32] # equivalent to df_grade.loc[df_grade.Final > 32, :]
df_good_final.head()
# df_grade.Final > 32 返回 -> [0, 2, 3, 4, 6]
# df_grade.loc[[0, 2, 3, 4, 6], :]

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62
6,Otto,31.5,1,20.0,21,1,9.0,36.0,A,9374


In [26]:
df_grade[df_grade['Grade'] == 'A'][['Name', 'Grade', 'ID']]

Unnamed: 0,Name,Grade,ID
0,Liam,A,90743
1,Olivia,A,7284
3,Emma,A,1237
4,Amelia,A,62
6,Otto,A,9374


A useful function for filtering rows of categorical variables is `isin()`.

In [27]:
# print(df_grade["Grade"])
df_grade[df_grade["Grade"].isin(["A", "A-"])] # explain what this line does
# df_grade[df_grade.Grade.isin(["A", "A-"])]


Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62
6,Otto,31.5,1,20.0,21,1,9.0,36.0,A,9374


*(Exercise)*:

- Print the row of a student named "Ningyuan".
- Show all the rows where Mini_Exam1 is greater than or equal to 20 **and** Mini_Exam2 is greater than or equal to 15. 

In [28]:
df_grade[df_grade['Name'] == 'Ningyuan']

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
5,Ningyuan,31.0,1,19.0,19,1,8.0,24.0,B,87452


In [33]:
df_grade[(df_grade['Mini_Exam1'] >= 20) & (df_grade['Mini_Exam2'] >= 15)]
#                                       |

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284
6,Otto,31.5,1,20.0,21,1,9.0,36.0,A,9374


In [None]:
df_grade[df_grade['Mini_Exam1'] >= 20][df_grade['Mini_Exam2'] >= 15]

## Built-in Functions

`pandas` has provided a range of useful functions to get summary statistics. We will go over the most popular ones.

#### Average

The average of a column (of correct types) can be computed using `.mean()`.

In [38]:
df_grade.mean(numeric_only=True)

Previous_Part        31.071429
Participation1        1.000000
Mini_Exam1           19.785714
Mini_Exam2           17.857143
Participation2        1.000000
Mini_Exam3           11.000000
Final                32.214286
ID                29111.000000
dtype: float64

In [39]:
#Compute mean of Final column
avg_final = df_grade["Final"].mean()
avg_final

32.214285714285715

We can also do this for the whole dataset. But it is not encouraged as some non-numerical columns need to be thrown out.

In [40]:
df_grade.mean(numeric_only=True).reset_index()

Unnamed: 0,index,0
0,Previous_Part,31.071429
1,Participation1,1.0
2,Mini_Exam1,19.785714
3,Mini_Exam2,17.857143
4,Participation2,1.0
5,Mini_Exam3,11.0
6,Final,32.214286
7,ID,29111.0


#### Range

For numerical columns, we can query the `max` and `min`.

In [41]:
print(df_grade.Mini_Exam1.max())
print(df_grade.Mini_Exam1.min())

22.0
19.0


#### Convert Types

We can convert the type of a column using the **astype()** method.

In [7]:
print(df_grade.Mini_Exam2.dtype)
print(df_grade.Mini_Exam2.astype("float"))
print(df_grade.Mini_Exam2)

int64
0    20.0
1    16.0
2    19.0
3    13.0
4    17.0
5    19.0
6    21.0
Name: Mini_Exam2, dtype: float64
0    20
1    16
2    19
3    13
4    17
5    19
6    21
Name: Mini_Exam2, dtype: int64


#### Unique Values

The average and range don't apply to categorical variables such as *Grade* in this example. The **unique()** method returns an array (think of it as a list) of the unique values in the column

In [44]:
#Let's look at how many unique grades there were
list_grades = df_grade["Grade"].unique()

list_grades # a generalized list 

list(list_grades) 

['A', 'A-', 'B']

In [45]:
#We can slice list_grades just like a list
list_grades[1:3]

array(['A-', 'B'], dtype=object)

The **value_counts()** method returns the counts of each unique value in the column as a series

In [46]:
grade_breakdown = df_grade["Grade"].value_counts()
grade_breakdown

A     5
A-    1
B     1
Name: Grade, dtype: int64

In [47]:
num_of_As = df_grade["Grade"].value_counts()['A']
num_of_As

5

In [49]:
len(df_grade[df_grade['Grade'] == 'A'])

5

We can apply any of these built-in functions to multiple columns.

In [65]:
#applying function to multiple rows
df_grade[["Final", "Mini_Exam3"]].mean()

Final         32.214286
Mini_Exam3    11.000000
dtype: float64

As you can see, the end result is a series, where the column names become the index of the series. The **describe()** method gives you key stats (as a dataframe) for every numeric column.

In [50]:
#Using the describe() method
summary = df_grade.describe()
summary
# summary.round(2)

Unnamed: 0,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,ID
count,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0
mean,31.071429,1.0,19.785714,17.857143,1.0,11.0,32.214286,29111.0
std,0.838082,0.0,1.074598,2.734262,0.0,2.217356,3.828154,41131.08167
min,30.0,1.0,19.0,13.0,1.0,8.0,24.0,62.0
25%,30.5,1.0,19.0,16.5,1.0,9.5,32.5,4260.5
50%,31.0,1.0,19.5,19.0,1.0,10.5,33.0,7625.0
75%,31.75,1.0,20.0,19.5,1.0,12.75,33.75,48413.0
max,32.0,1.0,22.0,21.0,1.0,14.0,36.0,90743.0


We can index and slice the above dataframe like any other dataframe.

In [51]:
#slicing summary dataframe
summary.loc[["min", "max"], ["Final", "Previous_Part"]]

Unnamed: 0,Final,Previous_Part
min,24.0,30.0
max,36.0,32.0


#### Sorting

Let's see how we can sort a dataframe.
We can rearrange the rows based on the sorted value of a column.

In [58]:
# df_grade.sort_values(by = "Final")
# df_grade.sort_values(by = ["Final"])
# df_grade.sort_values(by = ["Final", "ID"])
# df_grade.sort_values(by = ["Final", "ID"], ascending=False)
# df_grade.sort_values(by = ["Final", "ID"], ascending=[False, True])

df_grade_copy = df_grade.head(10)
df_grade_copy.sort_values(by = ["Final", "ID"], ascending=[False, True], inplace=True)
df_grade_copy

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
6,Otto,31.5,1,20.0,21,1,9.0,36.0,A,9374
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284
5,Ningyuan,31.0,1,19.0,19,1,8.0,24.0,B,87452


In [None]:
df_grade.sort_values?

- `by=` name of the column to sort by
- `inplace=T` means we change the dataframe directly. By setting it to `F`, the dataframe is not changed. One can assign the sorted dataframe to another variable.
- `ascending=F` means we sort in the decreasing order of the column.

Now let's sort by multiple columns, specifying more than one column is essentially specifying a tie break.

In [None]:
#Sort by Mini Exam 1 and tie break with Previous Part

result_sorted = df_grade.sort_values(by = ["Mini_Exam1", "Previous_Part"], inplace =False, ascending=[False, True])
result_sorted.head()


#### Other Useful Functions

- The `corr()`/`cov()` method returns the correlation/covariance between columns in a dataframe.
- The `count()` method returns the number of non-null values in each column.

*(Exercise)*:

- What is the average of Mini_Exam1 and Mini_Exam2?
- What is the standard deviation of Final (use the std method)?
- Sort the dataframe alphabetically by the column 'Name'.
- What is the average of Final for grade 'A' students?

## Add and Delete Columns
Next, we look at how to create new columns

In [59]:
#Create a New Column that is a function of other columns
df_grade["Final_Perc"] = df_grade["Final"]/35
df_grade.head()

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID,Final_Perc
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743,0.942857
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284,0.914286
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625,0.942857
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237,0.971429
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62,0.957143


In [60]:
#Then delete it with the drop method
df_grade.drop(["Final_Perc"], inplace = True, axis=1)
df_grade.head()

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62


In [63]:
summary = df_grade.describe()
summary
summary.drop(["min", "max"], axis=0)


Unnamed: 0,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,ID
count,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0
mean,31.071429,1.0,19.785714,17.857143,1.0,11.0,32.214286,29111.0
std,0.838082,0.0,1.074598,2.734262,0.0,2.217356,3.828154,41131.08167
25%,30.5,1.0,19.0,16.5,1.0,9.5,32.5,4260.5
50%,31.0,1.0,19.5,19.0,1.0,10.5,33.0,7625.0
75%,31.75,1.0,20.0,19.5,1.0,12.75,33.75,48413.0


 The axis argument works as follows:
 
 - axis = 1 : delete columns given
 - axis = 0 : delete rows given.
 
 Let's look at an example where we delete rows

In [64]:
#Delete rows with index 0 and 2
drop_rows = df_grade.drop([0,2], inplace = False, axis=0)
drop_rows.head()

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62
5,Ningyuan,31.0,1,19.0,19,1,8.0,24.0,B,87452
6,Otto,31.5,1,20.0,21,1,9.0,36.0,A,9374


*(Exercise)*:

- Add a column called "Mini_Exam1_2" that is the sum of Mini_Exam1 and Mini_Exam2.

In [66]:
df_grade['Mini_Exam1_2'] = df_grade['Mini_Exam1'] + df_grade['Mini_Exam2']
df_grade

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,ID,Mini_Exam1_2
0,Liam,32.0,1,19.5,20,1,10.0,33.0,A,90743,39.5
1,Olivia,32.0,1,20.0,16,1,14.0,32.0,A,7284,36.0
2,Noah,30.0,1,19.0,19,1,10.5,33.0,A-,7625,38.0
3,Emma,31.0,1,22.0,13,1,13.0,34.0,A,1237,35.0
4,Amelia,30.0,1,19.0,17,1,12.5,33.5,A,62,36.0
5,Ningyuan,31.0,1,19.0,19,1,8.0,24.0,B,87452,38.0
6,Otto,31.5,1,20.0,21,1,9.0,36.0,A,9374,41.0


### Write to a CSV file

Finally, once you finish analyzing the dataframe, one can write it to CSV back.

In [None]:
#Here is how we write a dataframe
df_grade.to_csv("grade_new.csv")

## Excel Files (.xlsx)

We can use the read_excel/write_excel method, which both take a sheet name as input. You can use string formatting to access the correct sheet.  Lets say I want to read in the the workbook titled "Excel_Reading.xlsx" and add in averages at the end of each column.

In [80]:
#Read in an excel file
df1 = pd.read_excel("Data/Excel_Reading.xlsx", "Sheet1")
df1

  warn(msg)


Unnamed: 0,1000,5000,10000
0,0.030399,0.023484,0.023118
1,0.02369,0.022857,0.023055
2,0.029339,0.02326,0.022905
3,0.027113,0.025012,0.023738
4,0.026462,0.023854,0.023733
5,Not avail,0.024979,0.023728
6,0.028853,0.023676,0.022813
7,0.024345,0.022929,0.023441
8,,0.023021,0.022286
9,0.025326,0.024034,0.023799


In [None]:
#We can write the file with to_excel. We can specify a start row and column
df1.to_excel("NewFile.xlsx", "Sheet1", startrow=5, startcol=5)

In [84]:
df_grade.to_excel("Data/Excel_Reading.xlsx", "Sheet4", startrow=0, startcol=0)
