# Reading CSV files

In [1]:
# import the pandas library
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

print(pd.__version__)

2.1.1


## Reading CSV data

You can read the data with the help of **`.read_csv()`** function.

In [2]:
# Read the dataset
data = pd.read_csv('datasets/big_mart_sales.csv')
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [3]:
# Print the type of data
print(type(data), data.shape)

<class 'pandas.core.frame.DataFrame'> (8523, 12)


## Different challenges with CSV files

### Read the data except the first few rows in the file

- You may get an error while reading a CSV file because someone may have added few comments on the top of the file. 
- In pandas, if we try to read such data, you will get a **ParseError**, but we can still read the data set by skipping few rows from the top.
- To deal with the ParseError:
    - Open the csv file in the text editor and check if you have some comments on the top.
    - If yes, then count the number of rows to skip.
    - While reading file, pass the parameter **`skiprows = n (number of rows to skip)`** 

In [4]:
# This will give an error because the first few rows doesn't have CSV data
data1 = pd.read_csv('datasets/big_mart_sales_top_rows.csv')

ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2


In [5]:
# read the dataset
data1 = pd.read_csv('datasets/big_mart_sales_top_rows.csv', skiprows= 5)
data1.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### Reading data from multiple directories

Use the **`glob`** library to list the files in a directory.

In [9]:
# import the library
import glob

# list to store data frames
merged_data = []

# list all the files in the folder
for i in glob.glob('datasets/multi-directory/*'):
    print(i)

    # List all the files present in the sub-folder
    for file in glob.glob(i + '/*'):
        print(file)
        # Add to the list
        merged_data.append(pd.read_csv(file))

datasets/multi-directory\1985
datasets/multi-directory\1985\1985.csv
datasets/multi-directory\1987
datasets/multi-directory\1987\1987.csv
datasets/multi-directory\1997
datasets/multi-directory\1997\1997.csv
datasets/multi-directory\1998
datasets/multi-directory\1998\1998.csv
datasets/multi-directory\1999
datasets/multi-directory\1999\1999.csv
datasets/multi-directory\2002
datasets/multi-directory\2002\2002.csv
datasets/multi-directory\2004
datasets/multi-directory\2004\2004.csv
datasets/multi-directory\2007
datasets/multi-directory\2007\2007.csv
datasets/multi-directory\2009
datasets/multi-directory\2009\2009.csv


In [10]:
# concatenate the dataframes
final_data = pd.concat(merged_data)

In [11]:
final_data.head()

Unnamed: 0.1,Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,7,FDP10,,Low Fat,0.12747,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
1,18,DRI11,,Low Fat,0.034238,Hard Drinks,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.668
2,21,FDW12,,Regular,0.0354,Baking Goods,144.5444,OUT027,1985,Medium,Tier 3,Supermarket Type3,4064.0432
3,23,FDC37,,Low Fat,0.057557,Baking Goods,107.6938,OUT019,1985,Small,Tier 1,Grocery Store,214.3876
4,29,FDC14,,Regular,0.072222,Canned,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362


In [12]:
# Store the merged data with the help of to_csv() function
final_data.to_csv('datasets/merged_big_mart_sales.csv', index = False)

### Read a CSV of a specific delimeter

By default, while reading a CSV file pandas consider the seperator as **,**. But if the CSV file has some other seperator or delimiter like **[;, \t]** we need to specify that.

In [13]:
# read the data
data2 = pd.read_csv('datasets/big_mart_sales_delimiter.csv')
data2.head()

Unnamed: 0,Item_Identifier\tItem_Weight\tItem_Fat_Content\tItem_Visibility\tItem_Type\tItem_MRP\tOutlet_Identifier\tOutlet_Establishment_Year\tOutlet_Size\tOutlet_Location_Type\tOutlet_Type\tItem_Outlet_Sales
0,FDA15\t9.3\tLow Fat\t0.016047301\tDairy\t249.8...
1,DRC01\t5.92\tRegular\t0.019278216\tSoft Drinks...
2,FDN15\t17.5\tLow Fat\t0.016760075\tMeat\t141.6...
3,FDX07\t19.2\tRegular\t0.0\tFruits and Vegetabl...
4,NCD19\t8.93\tLow Fat\t0.0\tHousehold\t53.8614\...


In [14]:
# read the file again with delimiter parameter
data2 = pd.read_csv('datasets/big_mart_sales_delimiter.csv', delimiter='\t')
data2.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### Read first N rows of the data

In [15]:
# Specify the number of rows to read 
data3 = pd.read_csv('datasets/big_mart_sales.csv', nrows = 100)
data3.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [16]:
# shape of the data
print('Shape:', data3.shape)

Shape: (100, 12)


### Read Specific columns

If the dataset has large number of columns, it is impossible to go through all of them in a single go so we can read only specific columns at a time.

In [17]:
# read specific columns
column_data = pd.read_csv('datasets/big_mart_sales.csv', usecols = ['Item_Identifier', 'Item_Type',
                                                                    'Item_MRP', 'Item_Outlet_Sales'])
column_data.head()

Unnamed: 0,Item_Identifier,Item_Type,Item_MRP,Item_Outlet_Sales
0,FDA15,Dairy,249.8092,3735.138
1,DRC01,Soft Drinks,48.2692,443.4228
2,FDN15,Meat,141.618,2097.27
3,FDX07,Fruits and Vegetables,182.095,732.38
4,NCD19,Household,53.8614,994.7052
