# Collect Data From CSV Files

In this tutorial, we retrieve data from a .csv file.

### What is CSV?

* A CSV (comma-separated values) file is a text file in which information is separated by commas.
* CSV files are widely used as spreadsheets and databases.
* CSV files are convenient and can be read/output by many programs.



We use a Python package, pandas, to deal with CSV files (and many more other types of files). The pandas package has a data structure, called DataFrame, which reads in CSV files as tables with rows and columns. The first line in the CSV file is converted to names of columns, and other lines are converted to rows. In convention, we rename the pandas as pd to make it simple.

In [None]:
import pandas as pd

Once the pandas package is imported, we can start to use the functions included in this package. 

The function to read a CSV file, is .read_CSV(path_to_file). The return value of this function is a DataFrame, and in convention, we name it df. To make sure the data is read in successfully, we normally print the first 5 rows out, using .head().

We have two CSV files in the project, they are simpsons_paradox_covid.csv and HELPfull.csv. Let's play with simpsons_paradox_covid.csv at this time. 

In [None]:
df = pd.read_csv('data/simpsons_paradox_covid.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,age_group,vaccine_status,outcome
0,1,under 50,vaccinated,death
1,2,under 50,vaccinated,death
2,3,under 50,vaccinated,death
3,4,under 50,vaccinated,death
4,5,under 50,vaccinated,death


We can find the database read in, with four columns ['Unnamed: 0', 'age_group', 'vaccine_status', 'outcome']. The .head() function by default will print the first 5 rows (including the column head). If you want to print more rows out, you can specify as .head(10) to print first 10 rows out.

In [None]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,age_group,vaccine_status,outcome
0,1,under 50,vaccinated,death
1,2,under 50,vaccinated,death
2,3,under 50,vaccinated,death
3,4,under 50,vaccinated,death
4,5,under 50,vaccinated,death
5,6,under 50,vaccinated,death
6,7,under 50,vaccinated,death
7,8,under 50,vaccinated,death
8,9,under 50,vaccinated,death
9,10,under 50,vaccinated,death


We can use .info() to get a big picture of the data we read in.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 268166 entries, 0 to 268165
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Unnamed: 0      268166 non-null  int64 
 1   age_group       268166 non-null  object
 2   vaccine_status  268166 non-null  object
 3   outcome         268166 non-null  object
dtypes: int64(1), object(3)
memory usage: 8.2+ MB


Now we can find out this tabular data has 4 columns, and 268166 rows. 

To output a DataFrame to a CSV file, we use .to_CSV(path_to_file) accordingly. Caution: If there is no such file, a new file will be created; if there is a file already, it will be replaced!

In [None]:
df.to_csv('data/simpsons_paradox_covid_new.csv')

Now we finished the I/O of CSV files using pandas. We will cover more details in Data Understanding step.

### Exercise: Read the HELPfull.csv

Now we learned how to read from and write tp a CSV file. Make a duplicate of the project, and play with the **'data/HELPful.csv'** file in this project. 

Tasks you can try:
1.  Read the file in.
2.  Print the first 5 rows.
3.  Print the first 10 rows.
4.  Print the information of the dataset.
5.  Write the data to a file **'data/HELPful_new.csv'**

Collect Data From CSV by Di Wu is licensed under [CC BY NC SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).