# Reading Data

You might have your data in .csv files or SQL tables. Maybe Excel files. Or .tsv files. Or something else. But the goal is the same in all cases. If you want to analyze that data using pandas, the first step will be to read it into a data structure that’s compatible with pandas.

Pandas has many methods like 
1. read_csv : Read a comma-separated values (csv) file into DataFrame.
2. read_json: Reads a JSON file
3. read_excel : Read an Excel file into a pandas DataFrame.

### Reading a CSV file

In [2]:
import pandas as pd

In [3]:
pd.read_csv('./HouseData.csv') 

Unnamed: 0,Observation,Dist_Taxi,Dist_Market,Dist_Hospital,Carpet,Builtup,Parking,City_Category,Rainfall,price
0,1,9796,5250,10703,1659.0,1961,Open,CAT B,530,664900.0
1,2,8294,8186,12694,1461.0,1752,Not Provided,CAT B,210,398200.0
2,3,11001,14399,16991,1340.0,1609,Not Provided,CAT A,720,540100.0
3,4,8301,11188,12289,1451.0,1748,Covered,CAT B,620,537300.0
4,5,10510,12629,13921,1770.0,2111,Not Provided,CAT B,450,466200.0
...,...,...,...,...,...,...,...,...,...,...
715,925,9615,7904,12521,1451.0,1734,Open,CAT C,670,348800.0
716,926,7176,5779,12382,1539.0,1829,Open,CAT B,650,465800.0
717,927,10915,17486,15964,1549.0,1851,Not Provided,CAT C,1220,706200.0
718,928,12176,8518,15673,1582.0,1910,Covered,CAT C,1080,663900.0


If you want to load only selected columns you can specify **usecol** attribute

In [4]:
pd.read_csv('./HouseData.csv' , usecols=["Observation" , "Dist_Taxi"]) 

Unnamed: 0,Observation,Dist_Taxi
0,1,9796
1,2,8294
2,3,11001
3,4,8301
4,5,10510
...,...,...
715,925,9615
716,926,7176
717,927,10915
718,928,12176


You can use **na_values** attribute to specify extra values to be considered as NA

In [5]:
pd.read_csv('./HouseData.csv' , usecols=["Observation" , "Dist_Taxi"] , na_values=[2]) 

Unnamed: 0,Observation,Dist_Taxi
0,1.0,9796
1,,8294
2,3.0,11001
3,4.0,8301
4,5.0,10510
...,...,...
715,925.0,9615
716,926.0,7176
717,927.0,10915
718,928.0,12176


As you can see above value 2.0 has been replaced with NaN which signifies a NA value

Using the **nrows** attribute you can read a limited number of rows, instead of all the rows in a dataset, use nrows. This is especially useful when reading a large file into a pandas dataframe.

In [6]:
pd.read_csv('./HouseData.csv' , usecols=["Observation" , "Dist_Taxi"] , nrows=10) 

Unnamed: 0,Observation,Dist_Taxi
0,1,9796
1,2,8294
2,3,11001
3,4,8301
4,5,10510
5,6,6665
6,7,13153
7,8,5882
8,9,7495
9,10,8233


When Working with lage files you might want to specify the data type of each column to efficiently use your RAM. You can do so using the **dtype** attribute

* Let's check the default memory usage of the data frame
* For that we can use the *.memory_usage* function on the dataframe

In [9]:
df = pd.read_csv('./HouseData.csv')
df.memory_usage(index=True).sum()

57728

* The memory usage is around 57 KB

In [8]:
dtypes= {
    "Observation": "int16",
    "Dist_Taxi": "int16",
    "Dist_Market": "int16",
    "Dist_Hospital": "int16",
    "Carpet": "float16",
    "Builtup": "int16",
    "Parking": "category",
    "City_Category": "category",
    "Rainfall": "int16",
    "price": "float16"
}
df = pd.read_csv('./HouseData.csv' , dtype=dtypes)
df.memory_usage(index=True).sum()

13384

* After having correctly specified dtypes as needed we have reduced the memory footprint by around 4 times

## Reading a json file 

* We have available the *House Data* in json format as well.

* The json file can be different formats
    - Different Formats can be found in the [Offical Docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)

* What we have with us is the *records* format, where we have a json array of 'rows' , where each row is a dictionary with columns for keys and column values for values

```json
[{
  "Observation": 1,
  "Dist_Taxi": 9796,
  "Dist_Market": 5250,
  "Dist_Hospital": 10703,
  "Carpet": 1659.0,
  "Builtup": 1961,
  "Parking": "Open",
  "City_Category": "CAT B",
  "Rainfall": 530,
  "price": 664900.0
}, {
....
]
```

* To read the file we can simply use the *read_json* function provided by pandas

In [11]:
pd.read_json("./HouseData.json")

Unnamed: 0,Observation,Dist_Taxi,Dist_Market,Dist_Hospital,Carpet,Builtup,Parking,City_Category,Rainfall,price
0,1,9796,5250,10703,1659.0,1961,Open,CAT B,530,664900
1,2,8294,8186,12694,1461.0,1752,Not Provided,CAT B,210,398200
2,3,11001,14399,16991,1340.0,1609,Not Provided,CAT A,720,540100
3,4,8301,11188,12289,1451.0,1748,Covered,CAT B,620,537300
4,5,10510,12629,13921,1770.0,2111,Not Provided,CAT B,450,466200
...,...,...,...,...,...,...,...,...,...,...
715,925,9615,7904,12521,1451.0,1734,Open,CAT C,670,348800
716,926,7176,5779,12382,1539.0,1829,Open,CAT B,650,465800
717,927,10915,17486,15964,1549.0,1851,Not Provided,CAT C,1220,706200
718,928,12176,8518,15673,1582.0,1910,Covered,CAT C,1080,663900


### Reading Excel File

* Pandas also supports reading excel files but requires a dependency *xlrd* be installed
* Simlar to what we have seen above we can use the *read_excel* method
* We can conviently specify the sheet name(s) to read in using the sheet_name parameter

    - *sheet_name* can be a string, integer or a list
    Strings are used for sheet names. Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets. Specify None to get all sheets.

* If a list is provided as input then output is a dictionary of dataframes

[Official Docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)

We can install the dependency if it does not already exist using the *!* operator, which allows execution of command line commands from within jupter

In [4]:
!python -m pip install xlrd



* Here *CustomSheetName* is the sheet name given in Excel

In [5]:
pd.read_excel("./HouseData.xlsx", sheet_name = "CustomSheetName")

Unnamed: 0,Observation,Dist_Taxi,Dist_Market,Dist_Hospital,Carpet,Builtup,Parking,City_Category,Rainfall,price
0,1,9796,5250,10703,1659.0,1961,Open,CAT B,530,664900
1,2,8294,8186,12694,1461.0,1752,Not Provided,CAT B,210,398200
2,3,11001,14399,16991,1340.0,1609,Not Provided,CAT A,720,540100
3,4,8301,11188,12289,1451.0,1748,Covered,CAT B,620,537300
4,5,10510,12629,13921,1770.0,2111,Not Provided,CAT B,450,466200
...,...,...,...,...,...,...,...,...,...,...
715,925,9615,7904,12521,1451.0,1734,Open,CAT C,670,348800
716,926,7176,5779,12382,1539.0,1829,Open,CAT B,650,465800
717,927,10915,17486,15964,1549.0,1851,Not Provided,CAT C,1220,706200
718,928,12176,8518,15673,1582.0,1910,Covered,CAT C,1080,663900


In [11]:
df_dict = pd.read_excel("./HouseData.xlsx", sheet_name = [0, "CustomSheetName"])


In [12]:
df_dict[0].head(2)

Unnamed: 0,Observation,Dist_Taxi,Dist_Market,Dist_Hospital,Carpet,Builtup,Parking,City_Category,Rainfall,price
0,1,9796,5250,10703,1659.0,1961,Open,CAT B,530,664900
1,2,8294,8186,12694,1461.0,1752,Not Provided,CAT B,210,398200


In [13]:
df_dict["CustomSheetName"].head(2)

Unnamed: 0,Observation,Dist_Taxi,Dist_Market,Dist_Hospital,Carpet,Builtup,Parking,City_Category,Rainfall,price
0,1,9796,5250,10703,1659.0,1961,Open,CAT B,530,664900
1,2,8294,8186,12694,1461.0,1752,Not Provided,CAT B,210,398200


### Sources

* [Pydata - Read_csv] - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
* [Pydata - Read Json] - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
* [Pydata - Read Excel] - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
* [Xlrd] - https://xlrd.readthedocs.io/en/latest/