<a href="https://colab.research.google.com/github/dishankkalra23/Playing-with-csv-and-pandas/blob/main/Playing_with_csv_and_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Libraries

In [1]:
import pandas as pd

## Setting up Kaggle API

In [2]:
 ! pip install -q kaggle

## Uplading kaggle.json file which have API token

In [None]:
from google.colab import files
files.upload()

In [None]:
# Move the downloaded file to a location ~/.kaggle/kaggle.json.
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/

# You need to give proper permissions to the file (since this is a hidden folder
! chmod 600 ~/.kaggle/kaggle.json

## Copy API command from respective Kaggle dataset

In [5]:
! kaggle datasets download -d spscientist/students-performance-in-exams

Downloading students-performance-in-exams.zip to /content
  0% 0.00/8.70k [00:00<?, ?B/s]
100% 8.70k/8.70k [00:00<00:00, 18.4MB/s]


## Unzipping dataset file and removing zip file

In [6]:
!unzip \*.zip

Archive:  students-performance-in-exams.zip
  inflating: StudentsPerformance.csv  


In [7]:
! rm *.zip

# Loading dataset 

## Reading CSV file

In [14]:
dataset = '/content/StudentsPerformance.csv'
data = pd.read_csv(dataset)
data

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


### CSV stands for comma separated values - but they can actually be separated by different characters, tabs, white space, etc. If your file is separated by a colon, let's say, you can still use read_csv() with the sep parameter.

In [15]:
data_1 = pd.read_csv(dataset,sep=':')

'''
Note: This obviously didn't work because CSV file is separated by commas and seperators we used colon
Because there are no colons, nothing was separated and everything was read into one column
'''
data_1

Unnamed: 0,"gender,""race/ethnicity"",""parental level of education"",""lunch"",""test preparation course"",""math score"",""reading score"",""writing score"""
0,"female,""group B"",""bachelor's degree"",""standard..."
1,"female,""group C"",""some college"",""standard"",""co..."
2,"female,""group B"",""master's degree"",""standard"",..."
3,"male,""group A"",""associate's degree"",""free/redu..."
4,"male,""group C"",""some college"",""standard"",""none..."
...,...
995,"female,""group E"",""master's degree"",""standard"",..."
996,"male,""group C"",""high school"",""free/reduced"",""n..."
997,"female,""group C"",""high school"",""free/reduced"",..."
998,"female,""group D"",""some college"",""standard"",""co..."


### We can specify which line of the file is the header, which specifies the column labels. It's usually the first line, but sometimes we'll want to specify a later line if there is extra meta information at the top of the file. We can do that like this.

In [18]:
df = pd.read_csv(dataset, header=1)
'''
Here, row 2 was used as the the header and everything above that was cut off. 
By default, read_csv uses header=0, which uses the first line for column labels.
'''
df.head()

Unnamed: 0,female,group B,bachelor's degree,standard,none,72,72.1,74
0,female,group C,some college,standard,completed,69,90,88
1,female,group B,master's degree,standard,none,90,95,93
2,male,group A,associate's degree,free/reduced,none,47,57,44
3,male,group C,some college,standard,none,76,78,75
4,female,group B,associate's degree,standard,none,71,83,78


### If columns labels are not included in your file, you can use header=None to prevent your first line of data from being misinterpreted as column labels.

In [20]:
df = pd.read_csv(dataset, header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
1,female,group B,bachelor's degree,standard,none,72,72,74
2,female,group C,some college,standard,completed,69,90,88
3,female,group B,master's degree,standard,none,90,95,93
4,male,group A,associate's degree,free/reduced,none,47,57,44


In [22]:
labels = ['gen', 'ethn', 'LOE', 'lun', 'test', 'math_Score', 'reading_Score', 'writing_Score']
df = pd.read_csv(dataset, names=labels)
df.head()

Unnamed: 0,gen,ethn,LOE,lun,test,math_Score,reading_Score,writing_Score
0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
1,female,group B,bachelor's degree,standard,none,72,72,74
2,female,group C,some college,standard,completed,69,90,88
3,female,group B,master's degree,standard,none,90,95,93
4,male,group A,associate's degree,free/reduced,none,47,57,44


### In above if you want to tell pandas that there was a header line that you are replacing, you can specify the row of that line like this

In [34]:
labels = ['gen', 'ethn', 'LOE', 'lun', 'test', 'math_Score', 'reading_Score', 'writing_Score']
df = pd.read_csv(dataset, names=labels,header=0)
df.head()

# Note: Replacing 0th index with given labels

Unnamed: 0,gen,ethn,LOE,lun,test,math_Score,reading_Score,writing_Score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
