# Part 4

## Reading and writing text files

In [1]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

In [2]:
import os
os.getcwd()

'/home/nbuser/library'

In [4]:
os.chdir("data/") # same as R:setwd('')

In [5]:
os.listdir()

['nfl_frame.csv',
 'test3.csv',
 'dummydata2.csv',
 'test.csv',
 'dummydata1.csv',
 'dummydata11.csv',
 'tips.csv',
 'flights.csv',
 'test2.txt',
 'redwines.csv']

Files can be read using functions such as *read()* and *readlines()*, or by treating the file as an iterable object  of lines and working with the lines directly.

However in data science and analytics, we most commonly work with rectangular data stored in a familiar format such as CSV

In [6]:
dframe = pd.read_csv("dummydata1.csv",header = None) # R:read.csv

In [8]:
dframe

Unnamed: 0,0,1,2,3,4
0,q,r,s,t,apple
1,2,3,4,5,pear
2,a,s,d,f,rabbit
3,5,2,5,7,dog


You don't actually HAVE to change directory; instead, you can leave your working directory unchanged and specify the file path when reading:

*e.g.* `read_csv("D:/Training/Datasets/dummydata1.csv")`

#### File modes

When you open a file, by default it is opened in read mode. However this is not the only mode available. The mode of a file object states whether the file connection that we have opened will treat the file as containing text or binary data and whether we will be reading from or writing to the file or both. The following is a table of modes that file objects can have in Python3.

| Character | Meaning                                                         |
|-----------|-----------------------------------------------------------------|
| r         | open for reading (default)                                      |
| w         | open for writing, truncating the file first                     |
| x         | open for exclusive creation, failing if the file already exists |
| a         | open for writing, appending to the end of the file if it exists |
| b         | binary mode                                                     |
| t         | text mode (default)                                             |
| +         | open a disk file for updating (reading and writing)             |

In [9]:
dframe # first row taken as headers

Unnamed: 0,0,1,2,3,4
0,q,r,s,t,apple
1,2,3,4,5,pear
2,a,s,d,f,rabbit
3,5,2,5,7,dog


In [11]:
dframe = pd.read_csv("dummydata1.csv",header = None)
dframe

Unnamed: 0,0,1,2,3,4
0,q,r,s,t,apple
1,2,3,4,5,pear
2,a,s,d,f,rabbit
3,5,2,5,7,dog


In [12]:
dframe

Unnamed: 0,0,1,2,3,4
0,q,r,s,t,apple
1,2,3,4,5,pear
2,a,s,d,f,rabbit
3,5,2,5,7,dog


In [13]:
dframe = pd.read_table("dummydata11.csv",sep = '|') # generic delimited file input; R:read.table
dframe

Unnamed: 0,Col1,Col2,Col3,Col4
q,r,s,t,apple
2,3,4,5,pear
a,s,d,f,rabbit
5,2,5,7,dog


In [14]:
pd.read_csv("dummydata1.csv",header = None,nrows = 2) # read only a specific number of rows

Unnamed: 0,0,1,2,3,4
0,q,r,s,t,apple
1,2,3,4,5,pear


In [15]:
dframe

Unnamed: 0,Col1,Col2,Col3,Col4
q,r,s,t,apple
2,3,4,5,pear
a,s,d,f,rabbit
5,2,5,7,dog


In [16]:
dframe.to_csv('test.csv')

In [17]:
dframe.to_csv('test2.txt',sep="~")

In [18]:
dframe.columns

Index(['Col1', 'Col2', 'Col3', 'Col4'], dtype='object')

In [19]:
dframe.to_csv('test3.csv',columns=['Col1','Col2']) # write only specific columns

In [21]:
tp = pd.read_csv('redwines.csv')

In [22]:
tp.tail()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_so2,total_so2,density,pH,sulphates,alcohol,quality
1594,6.5,0.4,0.1,2.0,0.076,30.0,47.0,0.99554,3.36,0.48,9.4,6
1595,11.6,0.41,0.54,1.5,0.095,22.0,41.0,0.99735,3.02,0.76,9.9,7
1596,10.2,0.34,0.48,2.1,0.052,5.0,9.0,0.99458,3.2,0.69,12.1,7
1597,6.6,0.44,0.15,2.1,0.076,22.0,53.0,0.9957,3.32,0.62,9.3,5
1598,8.2,0.915,0.27,2.1,0.088,7.0,23.0,0.9962,3.26,0.47,10.0,4


In [23]:
tp.shape

(1599, 12)

In [24]:
tp.columns

Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_so2', 'total_so2', 'density', 'pH', 'sulphates',
       'alcohol', 'quality'],
      dtype='object')

In [25]:
tp.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_so2,total_so2,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [26]:
tp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed_acidity       1599 non-null float64
volatile_acidity    1599 non-null float64
citric_acid         1599 non-null float64
residual_sugar      1599 non-null float64
chlorides           1599 non-null float64
free_so2            1599 non-null float64
total_so2           1599 non-null float64
density             1599 non-null float64
pH                  1599 non-null float64
sulphates           1599 non-null float64
alcohol             1599 non-null float64
quality             1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [27]:
tp.quality.value_counts()

5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

In [28]:
tp.alcohol.describe()

count    1599.000000
mean       10.422983
std         1.065668
min         8.400000
25%         9.500000
50%        10.200000
75%        11.100000
max        14.900000
Name: alcohol, dtype: float64

## End of part 4