<a href="https://colab.research.google.com/github/csaatechnicalarts/ML_Bootcamp/blob/main/Pandas_04_Plotting_FileIO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Pandas 04: Plotting and File I/O

##Introduction
Being able to load and store data, as well as display graphs and plots are crucial tasks in data analysis. Fortunately Pandas can handle a variety of file formats, from CSV, tab-delimited and comma-delimited text files, to HTML, JSON and HDF5. In this tutorial we'll introduce some of the basics for handling CSV files.

##Read Data From a CSV File
To read a CSV data file into a DataFrame, call the *read_csv()* function with the path to the CSV file, along with the appropriate keyword arguments
* delimiter - This parameter specifies the character separating the data fields. The comma character **(,)** is the default. Other common delimiters include tabs (\t), semicolons **(;)**, spaces **(` `)**, or even custom characters.
* header - Determines how column names are handled when reading the file data. By default **(header = 'infer')** Pandas assumes the first row in the file lists the column labels. If the CSV file doesn't have labels in the first row **(header = None)**, Pandas will assign the default numerical labels, starting with zero onwards. With **(header = 0)** Pandas will treat the first row as a list of column labels.
* index_col - Set one of the columns of the CSV file to turn it into the index for the data frame (the label for the rows).
* skiprows - Use this to skip rows in the CSV file. We can set it to a single integer or to a list of integers.
* names - Used together with *header*. To ignore the first row and supply a list of column labels, pass **(header = 0, names = ['col_01', 'col_02', etc.])** to Pandas.

Let's run some examples using the CSV file with information about World War 2 leaders ("ww2_leaders.csv").


Calling *read_csv() with the default parameters instructs Pandas to read the CSV and treat the first row as the labels for the columns.

In [1]:
import pandas as pd

df = pd.read_csv("sample_data/ww2_leaders.csv")
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Born,Died,Age,Title,Country
0,Franklin Roosevelt,1882-01-30,1945-04-12,63,President,United States
1,Joseph Stalin,1878-12-06,1953-03-05,74,Great Leader,Soviet Union
2,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer,Germany
3,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor,Japan
4,Charles de Gaulle,1890-11-22,1970-11-09,79,President,France
5,Winston Churchill,1874-11-30,1965-01-24,90,Prime Minister,United Kingdom
6,Manuel Camacho,1897-04-24,1955-10-13,58,President,Mexico
7,Jan Smuts,1870-05-24,1950-09-11,80,Prime Minister,South Africa
8,Ibn Saud,1875-01-15,1953-11-09,78,King,Saudi Arabia
9,Plaek Phibunsongkhram,1897-07-14,1965-06-11,66,Prime Minister,Thailand


Alternately we can ignore the headers supplied in the CSV file. Pandas will instead supply zero-based labels for the data frame columns.

In [3]:
df = pd.read_csv("sample_data/ww2_leaders.csv", header = None)
df

Unnamed: 0,0,1,2,3,4,5
0,Name,Born,Died,Age,Title,Country
1,Franklin Roosevelt,1882-01-30,1945-04-12,63,President,United States
2,Joseph Stalin,1878-12-06,1953-03-05,74,Great Leader,Soviet Union
3,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer,Germany
4,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor,Japan
5,Charles de Gaulle,1890-11-22,1970-11-09,79,President,France
6,Winston Churchill,1874-11-30,1965-01-24,90,Prime Minister,United Kingdom
7,Manuel Camacho,1897-04-24,1955-10-13,58,President,Mexico
8,Jan Smuts,1870-05-24,1950-09-11,80,Prime Minister,South Africa
9,Ibn Saud,1875-01-15,1953-11-09,78,King,Saudi Arabia


We can also skip the row of columns labels in the CSV file and substitute another list of labels.

In [5]:
df = pd.read_csv("sample_data/ww2_leaders.csv", header = 0, names = ['col_01', 'col_02', 'col_03', 'col_04', 'col_05', 'col_06'])
df

Unnamed: 0,col_01,col_02,col_03,col_04,col_05,col_06
0,Franklin Roosevelt,1882-01-30,1945-04-12,63,President,United States
1,Joseph Stalin,1878-12-06,1953-03-05,74,Great Leader,Soviet Union
2,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer,Germany
3,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor,Japan
4,Charles de Gaulle,1890-11-22,1970-11-09,79,President,France
5,Winston Churchill,1874-11-30,1965-01-24,90,Prime Minister,United Kingdom
6,Manuel Camacho,1897-04-24,1955-10-13,58,President,Mexico
7,Jan Smuts,1870-05-24,1950-09-11,80,Prime Minister,South Africa
8,Ibn Saud,1875-01-15,1953-11-09,78,King,Saudi Arabia
9,Plaek Phibunsongkhram,1897-07-14,1965-06-11,66,Prime Minister,Thailand


Now let's use the last column, **'Country'**, as the list for row labels.

In [7]:
df = pd.read_csv("sample_data/ww2_leaders.csv", index_col = 5)
df

Unnamed: 0_level_0,Name,Born,Died,Age,Title
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States,Franklin Roosevelt,1882-01-30,1945-04-12,63,President
Soviet Union,Joseph Stalin,1878-12-06,1953-03-05,74,Great Leader
Germany,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer
Japan,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor
France,Charles de Gaulle,1890-11-22,1970-11-09,79,President
United Kingdom,Winston Churchill,1874-11-30,1965-01-24,90,Prime Minister
Mexico,Manuel Camacho,1897-04-24,1955-10-13,58,President
South Africa,Jan Smuts,1870-05-24,1950-09-11,80,Prime Minister
Saudi Arabia,Ibn Saud,1875-01-15,1953-11-09,78,King
Thailand,Plaek Phibunsongkhram,1897-07-14,1965-06-11,66,Prime Minister
