In [4]:
import pandas as pd

# 6 Reading and writing files

Reading and writing is a very necessary part of data science as the data
generally will be in some sort of file like CSV or MS Excel.

Here we’ll be working of various file formats to read and write data.

## 6.1 Reading CSV

Reading CSV is a straightforward thing, but we’ll discuss some practical
issues faced when dealing with the live data

## 6.1.1 Reading

Let’s look at the way to read a csv data:

In [5]:
df = pd.read_csv('names_ages.csv')
df

Unnamed: 0,Name,Age,dob,gender
0,Arun,21,20/02/97,m
1,Brun,42,20/02/92,m
2,Ram,32,20/02/94,m
3,Mohan,25,20/02/99,m
4,Sita,21,20/02/97,f
5,Rita,42,20/02/92,f
6,Gita,32,20/02/94,f
7,Arti,25,20/02/99,f


## 6.1.2 Removing header

We can remove the headers present in the CSV if we don’t need them.

In [13]:
df = pd.read_csv("names_ages.csv", header=None)
df

Unnamed: 0,0,1,2,3
0,Name,Age,dob,gender
1,Arun,21,20/02/97,m
2,Brun,42,20/02/92,m
3,Ram,32,20/02/94,m
4,Mohan,25,20/02/99,m
5,Sita,21,20/02/97,f
6,Rita,42,20/02/92,f
7,Gita,32,20/02/94,f
8,Arti,25,20/02/99,f


## 6.1.3 Adding custom heade

Sometimes we need to give our own names to the header or if the data does
not have header present then we can add our own header by:

In [15]:
df = pd.read_csv('names_ages.csv',names=['NAME','AGE','dateOfBrith','GENDER'])
df.head()

Unnamed: 0,NAME,AGE,dateOfBrith,GENDER
0,Name,Age,dob,gender
1,Arun,21,20/02/97,m
2,Brun,42,20/02/92,m
3,Ram,32,20/02/94,m
4,Mohan,25,20/02/99,m


You can see the difference that we have change the case and “dob” has been
changed to “dateOfBirth”.

## 6.1.4 Reading specific rows

We can use head/tail etc. to access specific data in DataFrame but we can
also read specific rows from CSV as:


In [17]:
df = pd.read_csv('names_ages.csv',nrows=3)
df

Unnamed: 0,Name,Age,dob,gender
0,Arun,21,20/02/97,m
1,Brun,42,20/02/92,m
2,Ram,32,20/02/94,m


The value in ‘nrows’ will be the number of rows we want to see, and this will
except the header.


## 6.1.5 Reading the data from specific row

Sometimes there could be some text in the csv before the actual data starts
like:

In [18]:
df = pd.read_csv('name-age.csv')
df.head()

Unnamed: 0,name age date,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,name,age,dob,gender
2,ram,32,97/10/4,f
3,arun,33,97/10/5,m
4,barnun,34,97/10/6,f


By default, the 1
st columns become the header and the blank line becomes the
1
st
row, which we don’t want.
What we can do is:

df = pd.read_csv("name_age.csv", skiprows=2)

This will skip the first two rows in the csv and start the data from 3
rd
row,
which will then be considered as header.

In [19]:
df = pd.read_csv('name-age.csv',skiprows=2)
df.head()

Unnamed: 0,name,age,dob,gender
0,ram,32,97/10/4,f
1,arun,33,97/10/5,m
2,barnun,34,97/10/6,f
3,mohan,35,97/10/7,m
4,santosh,36,97/10/8,f


The alternate way is to specify the row of the header as:

df = pd.read_csv("name_age.csv", header=2)

header=2 means the third row in csv as the indexing starts from 0.


In [20]:
df = pd.read_csv('name-age.csv',header=2)
df.head()

Unnamed: 0,name,age,dob,gender
0,ram,32,97/10/4,f
1,arun,33,97/10/5,m
2,barnun,34,97/10/6,f
3,mohan,35,97/10/7,m
4,santosh,36,97/10/8,f


## 6.1.6 Cleaning NA data

Many a times in real data some cells might be empty. DataFrame consider
that as NAN (not a number) which can be dealt with but if data has its own
convention of telling NAN (like NA, not available etc.) then we have to
convert them to NAN format so that we can clean them later.
This can be done by:

df = pd.read_csv("name_age.csv", na_values=["na", "not available"])

In [27]:
df = pd.read_csv('name-age.csv',na_values=['na','not available'])
df

Unnamed: 0,name age date,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,name,age,dob,gender
2,ram,32,97/10/4,f
3,arun,33,97/10/5,m
4,barnun,34,97/10/6,
5,mohan,35,97/10/7,m
6,santosh,36,97/10/8,
7,sanjeep,37,97/10/9,m
8,dinesh,38,97/10/10,f
9,rita,39,97/10/11,m


Row 7 (Devi) already has gender column as NAN because the data was blank
there.

Now for example if someone has it’s name as “na” (let’s assume) and we do
the above cleaning then the name will become as NAN whereas the “na”
name was correct name.
So, to avoid this and to apply cleaning only to particular columns we can use
dictionaries as:

In [28]:
df = pd.read_csv('name-age.csv',na_values={'na','not available'})
df

Unnamed: 0,name age date,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,name,age,dob,gender
2,ram,32,97/10/4,f
3,arun,33,97/10/5,m
4,barnun,34,97/10/6,
5,mohan,35,97/10/7,m
6,santosh,36,97/10/8,
7,sanjeep,37,97/10/9,m
8,dinesh,38,97/10/10,f
9,rita,39,97/10/11,m


The cleaning has been applied to a specific column. Similarly, if the age is
negative value (which is not possible) we can add a cleaning for age column
and specify the negative value to be NAN.
