# `data analysis process`

    Data analysis is the process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful
    information, informing conclusions and supporting decision-making.

    We have 5 steps:
    1. Asking questions
    2. data wrangling
    3. eda (exploratory data analysis)
    4. drawing conclusions
    5. communicating results / data story telling

### `data gathering`
    data gathering can be done from various places like we can import the data, we can collect the data from the API, we can
    also do web scraping.
    
    While importing the data, we can import the data from the database, csv file, excel file, text file, and json file.
    
    We can now export this data into csv, json, excel, database, and html.

### `1. asking questions`

    In this we must ask ourselves the right set of questions. Some of the questions could be
    - What features will contribute to my analysis?
    - What features are not important for my analysis?
    - Which of the features have a strong correlation?
    - Do I need data pre processing?
    - What kind of feature manipulation/engineering is required?
    
    
    But how may i ask better questions ?
    
    There are 2 ways:
    - Subject matter expertise
    - experience



### `2. data wrangling`

    data wrangling sometimes reffered to as data munging is the process of transforming and mapping data from one "raw"
    data form into another format with the intent of making it more appropriate and valuable for a variety of downstream
    purposes such as analytics.
    
    This further involves three process:
    - data gathering: We may gather data from csv files, api, web scraping and databases.
    - accessing data: 
        - This involves finding the number of rows and columns (shape)
        - data types for various columns (info())
        - checking for missing values (info())
        - check for duplicate values (is_unique)
        - memory occupied by the dataset (info())
        - high level mathematical overview of the dataset (describe())
    - cleaning data
        - handle missing data (fill them or remove them)
        - remove duplicated data (drop_duplicates)
        - handle incorrect data type (astype)
                
### `3. exploratory data analysis`
    - explore
        - finding correlation vs covariance
        - doing univariate and multivariate analysis
        - plotting graphs (data visualization)
    - data augmentation: This means tranforming the data according to our needs. 
        - removing outliers
        - adding new features
        - merging dataframes
            
### `4. drawing conclusions`
    
    This involves making prediction using machine learning, inferential statistics and descriptive statistics.
    
    Some conclusions are so obvious that we can make them based on the descriptive statistics.
    
    - Is Rohit Sharma a better batsman in 2nd innings (IPL dataset)?
    - Does being a female increases your chances of survival (Titanic dataset)?
    - Is delhi the most costly place for eating out (zomato dataset)?
    
### `5. communicating results / data storytelling`
    - in reports
    - person to person
    - blog post
    - ppts/slide decks
    
    
    The fun part is that these steps are NOT sequential as shown in the diagram given below.

![dap](dap.png)

# importing csv files

    We can import csv files and convert them into a dataframe. This is done with the help of read_csv() method. This method
    can take multiple parameters. We are going to study them all.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

    We can simply import a csv file using read_csv() method. This method takes in a path of a file.

In [2]:
df = pd.read_csv('aug_train.csv')
df.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


    We can also import tab seperated files. We just need to use the same read_csv() method and pass in a seperator.

In [3]:
df = pd.read_csv('movie_titles_metadata.tsv')
df

Unnamed: 0,m0\t10 things i hate about you\t1999\t6.90\t62847\t['comedy' 'romance']
0,m1\t1492: conquest of paradise\t1992\t6.20\t10...
1,m2\t15 minutes\t2001\t6.10\t25854\t['action' '...
2,m3\t2001: a space odyssey\t1968\t8.40\t163227\...
3,m4\t48 hrs.\t1982\t6.90\t22289\t['action' 'com...
4,m5\tthe fifth element\t1997\t7.50\t133756\t['a...
...,...
611,m612\twatchmen\t2009\t7.80\t135229\t['action' ...
612,m613\txxx\t2002\t5.60\t53505\t['action' 'adven...
613,m614\tx-men\t2000\t7.40\t122149\t['action' 'sc...
614,m615\tyoung frankenstein\t1974\t8.00\t57618\t[...


    This is problem. We can resolve this issue by mentioning the sep parameter.

In [4]:
df = pd.read_csv('movie_titles_metadata.tsv', sep = '\t')
df

Unnamed: 0,m0,10 things i hate about you,1999,6.90,62847,['comedy' 'romance']
0,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
1,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
2,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
3,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']
4,m5,the fifth element,1997,7.5,133756.0,['action' 'adventure' 'romance' 'sci-fi' 'thri...
...,...,...,...,...,...,...
611,m612,watchmen,2009,7.8,135229.0,['action' 'crime' 'fantasy' 'mystery' 'sci-fi'...
612,m613,xxx,2002,5.6,53505.0,['action' 'adventure' 'crime']
613,m614,x-men,2000,7.4,122149.0,['action' 'sci-fi']
614,m615,young frankenstein,1974,8.0,57618.0,['comedy' 'sci-fi']


    This is the way we import tab seperated files. We can also import files on which other seperated character is used.

In [5]:
%%writefile mydata1.csv
Name:age
Abhishek:23
Daniyaal:46
Sameer:69

Overwriting mydata1.csv


In [6]:
df = pd.read_csv('mydata1.csv')
df

Unnamed: 0,Name:age
0,Abhishek:23
1,Daniyaal:46
2,Sameer:69


In [7]:
df = pd.read_csv('mydata1.csv', sep = ':')
df

Unnamed: 0,Name,age
0,Abhishek,23
1,Daniyaal,46
2,Sameer,69


## `header & names parameter`

    The header parameter is used to decide which row would to be used as the column names. By default the first row is 
    taken as the column names. If we want that first row not to be taken as column names, then we take header = None.
    
    If you pass the name parameter then your passed column names would be taken and this would be equivalent to header
    = None.

In [8]:
%%writefile mydata2.csv
name,age,marks
abhishek,23,98
daniyaal,46, 100
priyanka,23,90

Overwriting mydata2.csv


In [9]:
pd.read_csv('mydata2.csv') # first row has been infered for the column names

Unnamed: 0,name,age,marks
0,abhishek,23,98
1,daniyaal,46,100
2,priyanka,23,90


    But if we explicitly specify header = None, then first row would not be picked as column names and pandas would use
    its default numbering system.

In [10]:
pd.read_csv('mydata2.csv', header = None)

Unnamed: 0,0,1,2
0,name,age,marks
1,abhishek,23,98
2,daniyaal,46,100
3,priyanka,23,90


    But we may also provide the names of the columns, this would set header = None and use the passed names.

In [11]:
pd.read_csv('mydata2.csv', names = ['col_1', 'col_2', 'col_3'])

Unnamed: 0,col_1,col_2,col_3
0,name,age,marks
1,abhishek,23,98
2,daniyaal,46,100
3,priyanka,23,90


    When passing names if we specify header = None or not does not matter as this is implicit.

In [12]:
pd.read_csv('mydata2.csv', names = ['col_1', 'col_2', 'col_3'], header = None)

Unnamed: 0,col_1,col_2,col_3
0,name,age,marks
1,abhishek,23,98
2,daniyaal,46,100
3,priyanka,23,90


## `index_col`
    
    The index_col parameter is used to specify the custom index for the dataframe. By default pandas uses the default
    numeric indexing for the index of the dataframe.

In [13]:
pd.read_csv('mydata2.csv', index_col = 0)

Unnamed: 0_level_0,age,marks
name,Unnamed: 1_level_1,Unnamed: 2_level_1
abhishek,23,98
daniyaal,46,100
priyanka,23,90


    If we set multiple values in index_col, then we get a dataframe with multi index.

In [14]:
pd.read_csv('mydata2.csv', index_col = [0, 1])

Unnamed: 0_level_0,Unnamed: 1_level_0,marks
name,age,Unnamed: 2_level_1
abhishek,23,98
daniyaal,46,100
priyanka,23,90


## `chunking`

    Sometimes the dataset to be imported is very large, in that case we may decide to split the dataset into multiple
    parts so that we can work with them efficiently.

    Let us see the shape of aug_train dataset. This dataset is quite huge.

In [15]:
pd.read_csv('aug_train.csv').shape

(19158, 14)

    Let us do chunking on this dataset. We can specify the chunk size parameter while importing the csv file. This splits
    our large dataframe into multiple smaller dataframes and returns us an iterator. We may decide to iterate on this 
    iterator and store individual datasets into seperate variables.

In [16]:
dfs = pd.read_csv('aug_train.csv', chunksize = 5000)
dfs

<pandas.io.parsers.readers.TextFileReader at 0x204ddf3bf50>

In [17]:
chunks = list(dfs)

In [18]:
len(chunks)

4

In [19]:
for chunk in chunks:
    print(chunk.shape)

(5000, 14)
(5000, 14)
(5000, 14)
(4158, 14)


In [20]:
chunks[0].head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [21]:
chunks[1].head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
5000,18582,city_83,0.923,Male,Has relevent experience,no_enrollment,Masters,STEM,9,10000+,Pvt Ltd,4,170,0.0
5001,25222,city_159,0.843,,Has relevent experience,no_enrollment,Graduate,STEM,7,<10,Pvt Ltd,1,107,0.0
5002,5697,city_61,0.913,Female,Has relevent experience,no_enrollment,Graduate,STEM,>20,100-500,,1,9,0.0
5003,5172,city_21,0.624,Male,No relevent experience,Full time course,Graduate,STEM,3,,,never,38,1.0
5004,1815,city_27,0.848,Male,Has relevent experience,no_enrollment,Masters,STEM,18,10000+,Pvt Ltd,>4,17,1.0


In [22]:
chunks[2].head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
10000,23427,city_24,0.698,,No relevent experience,,High School,,8,,,never,63,0.0
10001,17605,city_50,0.896,,No relevent experience,no_enrollment,Graduate,STEM,>20,,,never,10,0.0
10002,20912,city_73,0.754,,No relevent experience,Full time course,Graduate,STEM,5,,,never,46,0.0
10003,13948,city_114,0.926,Male,No relevent experience,no_enrollment,Phd,STEM,>20,10000+,Pvt Ltd,1,18,0.0
10004,15205,city_160,0.92,Male,No relevent experience,Full time course,High School,,7,100-500,Pvt Ltd,1,55,0.0


In [23]:
chunks[3].head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
15000,6178,city_103,0.92,,Has relevent experience,no_enrollment,Graduate,Arts,2,50-99,Funded Startup,1,51,0.0
15001,27557,city_75,0.939,Male,Has relevent experience,no_enrollment,Graduate,STEM,6,,Public Sector,1,12,1.0
15002,27751,city_160,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,9,50-99,Pvt Ltd,1,9,0.0
15003,11186,city_103,0.92,Male,No relevent experience,no_enrollment,Graduate,STEM,15,,,>4,156,0.0
15004,1600,city_150,0.698,Male,Has relevent experience,no_enrollment,Graduate,STEM,10,<10,Pvt Ltd,1,14,0.0


## `use_cols & squeeze parameter`

    Sometimes the data to be imported has large number of columns and we only want to work on some specific columns.
    This is done by using the use_cols parameter.

In [24]:
%%writefile mydata3.csv
name,age,height,weight,income,college
abhishek,23,6,96,62000,DTU
daniyaal,23,6,76,100000,DTU
sameer,45,5.11,67,100000,MIT
aryan,19,5.8,78,30000,MIT

Overwriting mydata3.csv


In [25]:
pd.read_csv('mydata3.csv')

Unnamed: 0,name,age,height,weight,income,college
0,abhishek,23,6.0,96,62000,DTU
1,daniyaal,23,6.0,76,100000,DTU
2,sameer,45,5.11,67,100000,MIT
3,aryan,19,5.8,78,30000,MIT


In [26]:
pd.read_csv('mydata3.csv', usecols = [0, 1])

Unnamed: 0,name,age
0,abhishek,23
1,daniyaal,23
2,sameer,45
3,aryan,19


    We can also have squeeze method when we have only 1 column. In that case we would want to convert it into a series
    object and not a dataframe probably.

In [27]:
pd.read_csv('mydata3.csv', usecols = [0]).squeeze(True)

0    abhishek
1    daniyaal
2      sameer
3       aryan
Name: name, dtype: object

# `skiprows & nrows parameter`

    We can also use the skiprows parameter to skip certain rows and nrows parameter to tell pandas about how many rows
    to include.

In [28]:
pd.read_csv('mydata3.csv')

Unnamed: 0,name,age,height,weight,income,college
0,abhishek,23,6.0,96,62000,DTU
1,daniyaal,23,6.0,76,100000,DTU
2,sameer,45,5.11,67,100000,MIT
3,aryan,19,5.8,78,30000,MIT


In [29]:
pd.read_csv('mydata3.csv', nrows = 2) # takes only the first 2 rows of the dataframe.

Unnamed: 0,name,age,height,weight,income,college
0,abhishek,23,6,96,62000,DTU
1,daniyaal,23,6,76,100000,DTU


# `skipping bad lines`

    Sometimes while importing dataset, we have certain rows which are different from all the other rows. For example,
    all the rows have the size of 7 except for some certain rows. Then these certain rows are called bad rows. We do
    not want to include these rows while importing our dataset. when we import the dataset we get the ParserError.

In [30]:
%%writefile bad_data.csv
name,age,college
abhishek,23,DTU
sameer,23,DTU
daniyaal,23,MIT,, # This is the bad row, this will give us an error while importing the dataset.
siddharth,23,Harvard
viraj sehgal,24,oxford

Overwriting bad_data.csv


In [31]:
from pandas.errors import ParserError

In [32]:
try:
    pd.read_csv('bad_data.csv')
except ParserError as ex:
    print(ex)

Error tokenizing data. C error: Expected 3 fields in line 4, saw 6



In [33]:
pd.read_csv('bad_data.csv', on_bad_lines = 'skip')

Unnamed: 0,name,age,college
0,abhishek,23,DTU
1,sameer,23,DTU
2,siddharth,23,Harvard
3,viraj sehgal,24,oxford


# `dtype parameter`

    The dtypes parameter is used to change the data type of a column during importing.

In [34]:
pd.read_csv('aug_train.csv')

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [35]:
pd.read_csv('aug_train.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [36]:
pd.read_csv('aug_train.csv', dtype = {'target': int})

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0


# `converting dates`

    We can also convert a column into datetime format at the time of loading.

In [37]:
%%writefile mydata4.csv
name,age,date
abhishek,23,4 June 2000
amrusha,24,22 July 1999
priyanka,24,20 June 2000

Overwriting mydata4.csv


In [38]:
df = pd.read_csv('mydata4.csv')
df

Unnamed: 0,name,age,date
0,abhishek,23,4 June 2000
1,amrusha,24,22 July 1999
2,priyanka,24,20 June 2000


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    3 non-null      object
 1   age     3 non-null      int64 
 2   date    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes


    We can covert certain data columns directly into datetime64[ns] using parse_dates parameter.

In [40]:
df = pd.read_csv('mydata4.csv', parse_dates = ['date'])
df

Unnamed: 0,name,age,date
0,abhishek,23,2000-06-04
1,amrusha,24,1999-07-22
2,priyanka,24,2000-06-20


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   name    3 non-null      object        
 1   age     3 non-null      int64         
 2   date    3 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 204.0+ bytes


    We can also give a list of columns, to be combined to be used as parsed as date column

In [42]:
%%writefile mydata5.csv
name,age,year,month,day
abhishek,23,2000,june,4
amrusha,24,1999,july,22
priyanka,23,1999,june,20
jacob,24,1999,july,22

Overwriting mydata5.csv


In [43]:
pd.read_csv('mydata5.csv', parse_dates = [[2,3,4]])

  pd.read_csv('mydata5.csv', parse_dates = [[2,3,4]])


Unnamed: 0,year_month_day,name,age
0,2000-06-04,abhishek,23
1,1999-07-22,amrusha,24
2,1999-06-20,priyanka,23
3,1999-07-22,jacob,24


## `converters`
    
    The read_csv() method takes in a parameter called as converters which takes in a dictionary. This dictionary has column
    names as keys and functions as values. The function is executed for each and every value of this column which has been
    provided as a key.

In [44]:
%%writefile mydata6.csv
name,age,marks
abhishek,23,95
amrusha,24,33
priyanka,24,87
daniyaal,34,56

Overwriting mydata6.csv


In [45]:
def converter(marks):
    if int(marks) > 33:
        return "pass"
    return "fail"

In [46]:
pd.read_csv('mydata6.csv', converters = {'marks' : converter}) # The functions converter will be exec for every value of marks

Unnamed: 0,name,age,marks
0,abhishek,23,pass
1,amrusha,24,fail
2,priyanka,24,pass
3,daniyaal,34,pass


## `na_values`

    The na_values parameter in read_csv() takes in a list of values which are to be considered as missing value while
    importing the dataset.

In [47]:
%%writefile mydata7.csv
name,age,marks
abhishek,23,95
amrusha,24,-
priyanka,24,87
daniyaal,34,-

Overwriting mydata7.csv


In [48]:
df = pd.read_csv('mydata7.csv', na_values = ['-'])
df

Unnamed: 0,name,age,marks
0,abhishek,23,95.0
1,amrusha,24,
2,priyanka,24,87.0
3,daniyaal,34,


    We also have another flavour of na_values which takes in a dictionary. We can specify which values to be considered as
    NaN values per column.

In [49]:
pd.read_csv('mydata7.csv', na_values = {'age' :[23], 'marks': '-'})
# The above code would consider 23 as a missing value in the age column and - as the missing
# value while in the marks column while importing the csv file.

Unnamed: 0,name,age,marks
0,abhishek,,95.0
1,amrusha,24.0,
2,priyanka,24.0,87.0
3,daniyaal,34.0,


# `reading json files`

    For importing json data into a dataframe, we need to use the read_json() function.

In [50]:
%%writefile data.json
[
    {
        "name": "Jason",
        "gender": "M",
        "age": 27
    },
    {
        "name": "Rosita",
        "gender": "F",
        "age": 23
    },
    {
        "name": "Leo",
        "gender": "M",
        "age": 19
    }
]

Overwriting data.json


In [51]:
pd.read_json('data.json')

Unnamed: 0,name,gender,age
0,Jason,M,27
1,Rosita,F,23
2,Leo,M,19


# `reading text files`

    For reading text files, we can use read_csv() method. This method takes in a text file and converts it into csv.

In [52]:
%%writefile text.txt
name,age,marks
abhishek,23,95
amrusha,24,-
priyanka,24,87
daniyaal,34,-

Overwriting text.txt


In [53]:
df = pd.read_csv('text.txt')
df

Unnamed: 0,name,age,marks
0,abhishek,23,95
1,amrusha,24,-
2,priyanka,24,87
3,daniyaal,34,-


In [54]:
%%writefile text1.txt
name:age:marks
abhishek:23:95
amrusha:24:-
priyanka:24:87
daniyaal:34:-

Overwriting text1.txt


In [55]:
df = pd.read_csv('text1.txt', sep = ':')
df

Unnamed: 0,name,age,marks
0,abhishek,23,95
1,amrusha,24,-
2,priyanka,24,87
3,daniyaal,34,-


# `reading excel files`

    We can also read excel files using the read_excel() method. When we import a file using read_excel() method, it by
    default takes the data of the first excel sheet. If we want to import the data of the other sheet, then we need to
    specify that in the sheet_name parameter inside the read_excel() method.

In [56]:
pd.read_excel('marks.xlsx') # This is the data of the first sheet by default

Unnamed: 0,name,age
0,abhishek,23
1,daniyaal,23
2,sameer,23
3,siddharth,24
4,viraj,24
5,priyanka,25
6,simran,26


    To import the data of the other sheet, we need to specify the sheet name.

In [57]:
pd.read_excel('marks.xlsx', na_values = {'age' : [' '] },sheet_name = 'marks_sheet')

Unnamed: 0,name,age,marks
0,abhishek,23.0,96
1,daniyaal,23.0,98
2,sameer,,99
3,siddharth,24.0,99
4,viraj,,78
5,priyanka,25.0,57
6,simran,26.0,90


# `reading mysql files`

In [58]:
import mysql.connector

In [59]:
try:
    conn = mysql.connector.connect(host = 'localhost', user = 'root', password = '', database = 'world')
except Exception as ex:
    print(ex.__class__.__name__)
    print(ex)

ProgrammingError
1049 (42000): Unknown database 'world'


In [60]:
try:
    df = pd.read_sql_query('SELECT * FROM country', conn)
    df
except Exception as ex:
    print(ex.__class__.__name__)
    print(ex)

NameError
name 'conn' is not defined
