## Reading and Writing Files in Pandas
Pandas doc: https://pandas.pydata.org/pandas-docs/stable/reference/io.html<br/>
Tutorial: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Data source: https://databank.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG/1ff4a498/Popular-Indicators#

### Read **.csv** file

In [4]:
%pwd

'C:\\Users\\Payman\\Documents\\Python Scripts\\Session 6\\sample_data'

In [3]:
%cd ./sample_data

C:\Users\Payman\Documents\Python Scripts\Session 6\sample_data


In [6]:
import pandas as pd
#read the csv file 
df = pd.read_csv('countries.csv') #this will read the data into dataframe.
df

Unnamed: 0,ID,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,CHN,China,1398.72,9596.96,12234.8,Asia,
1,IND,India,1351.16,3287.26,2575.67,Asia,8/15/1947
2,USA,US,329.74,9833.52,19485.4,N.America,1776-07-04
3,IDN,Indonesia,268.07,1910.93,1015.54,Asia,8/17/1945
4,BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
5,PAK,Pakistan,205.71,881.91,302.14,Asia,8/14/1947
6,NGA,Nigeria,200.96,923.77,375.77,Africa,10/1/1960
7,BGD,Bangladesh,167.09,147.57,245.63,Asia,3/26/1971
8,RUS,Russia,146.79,17098.2,1530.75,,6/12/1992
9,MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


In [None]:
# More documentation on read_csv method
help(pd.read_csv)

We can provide variety of arguments as options. Let's try a couple.

In [9]:
import pandas as pd
#read the csv file 
df = pd.read_csv('countries.csv', skiprows = 1, header = None)  #this will read skip the header
df

Unnamed: 0,0,1,2,3,4,5,6
0,CHN,China,1398.72,9596.96,12234.8,Asia,
1,IND,India,1351.16,3287.26,2575.67,Asia,8/15/1947
2,USA,US,329.74,9833.52,19485.4,N.America,1776-07-04
3,IDN,Indonesia,268.07,1910.93,1015.54,Asia,8/17/1945
4,BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
5,PAK,Pakistan,205.71,881.91,302.14,Asia,8/14/1947
6,NGA,Nigeria,200.96,923.77,375.77,Africa,10/1/1960
7,BGD,Bangladesh,167.09,147.57,245.63,Asia,3/26/1971
8,RUS,Russia,146.79,17098.2,1530.75,,6/12/1992
9,MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


In [10]:
import pandas as pd
#read the csv file 
df = pd.read_csv('countries.csv', sep = ',')  #this is already a CSV file and ',' is the default input
df.head()

Unnamed: 0,ID,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,CHN,China,1398.72,9596.96,12234.8,Asia,
1,IND,India,1351.16,3287.26,2575.67,Asia,8/15/1947
2,USA,US,329.74,9833.52,19485.4,N.America,1776-07-04
3,IDN,Indonesia,268.07,1910.93,1015.54,Asia,8/17/1945
4,BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07


### Read **.txt** file

In [11]:
import pandas as pd
#read the csv file 
df = pd.read_csv('countries(coma).txt')  #we can read comma separated values in txt files as well. Notice that sep="," is default.
df.head()

Unnamed: 0,ID,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,CHN,China,1398.72,9596.96,12234.8,Asia,
1,IND,India,1351.16,3287.26,2575.67,Asia,8/15/1947
2,USA,US,329.74,9833.52,19485.4,N.America,1776-07-04
3,IDN,Indonesia,268.07,1910.93,1015.54,Asia,8/17/1945
4,BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07


In [12]:
import pandas as pd
#read the csv file 
df = pd.read_csv('countries(tab).txt')  #seperator is not comma any more.
df.head()

Unnamed: 0,ID\tCOUNTRY\tPOP\tAREA\tGDP\tCONT\tIND_DAY
0,CHN\tChina\t1398.72\t9596.96\t12234.8\tAsia\tNaN
1,IND\tIndia\t1351.16\t3287.26\t2575.67\tAsia\t8...
2,USA\tUS\t329.74\t9833.52\t19485.4\tN.America\t...
3,IDN\tIndonesia\t268.07\t1910.93\t1015.54\tAsia...
4,BRA\tBrazil\t210.32\t8515.77\t2055.51\tS.Ameri...


In [13]:
import pandas as pd
#read the csv file 
df = pd.read_csv('countries(tab).txt', sep = '\t')  #explicitly define the sep argument
df.head()

Unnamed: 0,ID,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,CHN,China,1398.72,9596.96,12234.8,Asia,
1,IND,India,1351.16,3287.26,2575.67,Asia,8/15/1947
2,USA,US,329.74,9833.52,19485.4,N.America,1776-07-04
3,IDN,Indonesia,268.07,1910.93,1015.54,Asia,8/17/1945
4,BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07


### Read **.xls** file

In [16]:
import pandas as pd
#read the Excel file 
df = pd.read_excel('countries.xlsx')  #read from an excel data source. This will read from the first sheet.
df.head()

  df = pd.read_excel('countries.xlsx')  #read from an excel data source. This will read from the first sheet.


Unnamed: 0,ID,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,CHN,China,1398.72,9596.96,12234.8,Asia,NaT
1,IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
2,USA,US,329.74,9833.52,19485.4,N.America,1776-07-04
3,IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
4,BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07


In [17]:
import pandas as pd
#read from an excel with a sheet name.
df = pd.read_excel('countries.xlsx', sheet_name = 'population')  
df.head()

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,2000 [YR2000],2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
0,"Population, total",SP.POP.TOTL,Afghanistan,AFG,20779953,21606988,22600770,23680871,24726684,25654277,26433049,27100536,27722276,28394813,29185507,30117413,31161376,32269589,33370794,34413603
1,"Population, total",SP.POP.TOTL,Albania,ALB,3089027,3060173,3051010,3039616,3026939,3011487,2992547,2970017,2947314,2927519,2913021,2905195,2900401,2895092,2889104,2880703
2,"Population, total",SP.POP.TOTL,Algeria,DZA,31042235,31451514,31855109,32264157,32692163,33149724,33641002,34166972,34730608,35333881,35977455,36661444,37383887,38140132,38923687,39728025
3,"Population, total",SP.POP.TOTL,American Samoa,ASM,57821,58494,59080,59504,59681,59562,59107,58365,57492,56683,56079,55759,55667,55713,55791,55812
4,"Population, total",SP.POP.TOTL,Andorra,AND,65390,67341,70049,73182,76244,78867,80993,82684,83862,84463,84449,83747,82427,80774,79213,78011


In [18]:
help(pd.read_excel)

Help on function read_excel in module pandas.io.excel._base:

read_excel(io, sheet_name: 'str | int | list[IntStrT] | None' = 0, *, header: 'int | Sequence[int] | None' = 0, names: 'list[str] | None' = None, index_col: 'int | Sequence[int] | None' = None, usecols: 'int | str | Sequence[int] | Sequence[str] | Callable[[str], bool] | None' = None, squeeze: 'bool | None' = None, dtype: 'DtypeArg | None' = None, engine: "Literal['xlrd', 'openpyxl', 'odf', 'pyxlsb'] | None" = None, converters: 'dict[str, Callable] | dict[int, Callable] | None' = None, true_values: 'Iterable[Hashable] | None' = None, false_values: 'Iterable[Hashable] | None' = None, skiprows: 'Sequence[int] | int | Callable[[int], object] | None' = None, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, parse_dates: 'list | dict | bool' = False, date_parser: 'Callable | None' = None, thousands: 'str | None' = None, decimal: 'str' = '.', comment: 'st

### Read **.JSON** file

In [19]:
import pandas as pd
#read the csv file 
df = pd.read_json('countries.json')  #read from a json data source. it converts that json format into dataframe format.
df.head()

Unnamed: 0,ID,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,CHN,China,1398.72,9596.96,12234.8,Asia,
1,IND,India,1351.16,3287.26,2575.67,Asia,8/15/1947
2,USA,US,329.74,9833.52,19485.4,N.America,1776-07-04
3,IDN,Indonesia,268.07,1910.93,1015.54,Asia,8/17/1945
4,BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07


#### Write **.csv** file
Let's change our directory.

Write the dataframe into a csv file with .csv extension.

In [20]:
df.to_csv('./outputs/df.txt') #write the dataframe into a csv file with .csv extension.

In [None]:
help(pd.DataFrame.to_csv)

Default separator is comma.

Write the dataframe into as a CSV format with .txt extension.

In [21]:
df.to_csv('./outputs/df(coma).txt')

We can change the separator to '\t' or any deliminator we'd like. 

In [22]:
df.to_csv('./outputs/df(tab).txt', sep = '\t') #this will place tab between records.

#### Write JSON File

In [23]:
df.to_json('./outputs/df.json')

In [27]:
df.to_json('./outputs/df.json', indent=2)

In [None]:
help(df.to_json)

In [28]:
my_json = pd.read_json('./outputs/df.json') #let's read it as a dataframe.
my_json.head() #in df format. But not in json format!!!

Unnamed: 0,ID,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
0,CHN,China,1398.72,9596.96,12234.8,Asia,
1,IND,India,1351.16,3287.26,2575.67,Asia,8/15/1947
2,USA,US,329.74,9833.52,19485.4,N.America,1776-07-04
3,IDN,Indonesia,268.07,1910.93,1015.54,Asia,8/17/1945
4,BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
