# Data Analysis: An Analogy
## Let's understand Data Analysis with an example
Imagine you own an apple farm, and you want to know the number of apples you grow. But, you are too busy with the farm so you hire someone to count them. You sell your apples too, and you get your apple counter to keep a record of the number of apples you have in the beginning and at the end of the day, every day.

Many days and months pass and you put sheet after sheet of the apple count together and you discover patterns and trends in the purchasing behaviour of your customers.

The trends and patterns help you realise that during the colder season, your output of apples is the same, but people buy fewer apples compared to the summer.

You then set out to dig deeper into this trend and find ways to keep the sales of apples consistent throughout the year, beating your competitors at the game and becoming an apple farm tycoon.

Apples are your data, tracking them is important, analysis is key.

For starters, you will know if your supply of apples matches the market’s demand, as well as the consistency of the ratio of demand to supply throughout the year. Pegging the price to each apple and drawing the cost down gives you your profit.

When you have enough data, you will find trends and patterns in your production. These trends can help you understand your own organisation better, help you reduce inefficiency, and therefore reduce costs.

# What is Data Analysis?
As an apple farmer, you collected the count of apples at the beginning of the day and at the end of the day in an organized fashion in the sheets. In the end, you got some insightful information (i.e., trends and patterns in the sales of apples) from that data. This is called Data Analysis.

So collecting all the words above:
Data Analysis is a method of collecting, organizing, and, if required, manipulating the data so that one can derive some useful information from the data.

## Data Analysis and Pandas
Pandas is a tool in Python that helps you collect (or read) data from a file, organize it in a tabular format, manipulate and clean it, if required, to derive insightful information from it.

# What is Pandas?
Officially stands for Python Data Analysis Library.
It is an open-source Python library.
It is a tool used by data scientists to:
-read,

-write,

-manipulate, and

-analyze the data.
## Why Pandas?
It helps you explore and manipulate data in an efficient manner.

It helps you analyze large volumes of data with ease. When we say large volumes, it can be in millions of rows/records.

## Why is Pandas so Popular
-Easy to read and learn

-Extremely fast and powerful

-Integrates well with other visualization libraries
### Importing Pandas
Anytime you want to use a library in Python, your first priority should be to make it accessible.

You can import/load Pandas in your notebook or any other Python IDE in two different ways:

In [1]:
# import pandas
import pandas as pd

# Series

# Pandas Objects
Before we dive into series, let’s do a quick recap of pandas ‘objects’. At the core of the pandas library, there are two fundamental data structures/objects:

Series

Data Frames
## What is a Series?
A one-dimensional labeled array

Can hold data of any type

Is like a column in a table

## What can a Series have?
A Series can have all the elements as numbers in it

A Series can have all the elements as strings in it:

A Series can have its elements as both numbers and strings.

Series is like a list in Python that can take any type of value like integers, strings, floats (or decimal values), etc.
All the items in the series are labeled with indexes

By default, indexing starts from 0 in Series.

### Create a Series
Remember to import the library before using it!

You can create your own Series using a Python list:

In [2]:
h = [1,1,2,3,4,5,6,76,5]
pd.Series(h)

0     1
1     1
2     2
3     3
4     4
5     5
6     6
7    76
8     5
dtype: int64

In [3]:
# You can also create your own Series using a dictionary:

d = {"one":1, "two":2, "three": 3, "four":4, "five":5, "six":6, "seven": 7}
pd.Series(d)

one      1
two      2
three    3
four     4
five     5
six      6
seven    7
dtype: int64

# DataFrame
### What is a DataFrame?
-Two-dimensional table

-Made up of a collection of Series

Structured with labeled axes (rows and columns)
## Create a DataFrame
You can create a DataFrame using a Python list or a NumPy array:

In [4]:
data = [[1000, "Leorio", 86.58],
       [1001, "Kilua", 86.58],
       [1002, "Zenitsu", 86.58],
       [1003, "Tomioka", 86.58],
       [1004, "Inosuke", 86.58]]
pd.DataFrame(data)

Unnamed: 0,0,1,2
0,1000,Leorio,86.58
1,1001,Kilua,86.58
2,1002,Zenitsu,86.58
3,1003,Tomioka,86.58
4,1004,Inosuke,86.58


In [5]:
# Don’t like python default index starting from ‘0’? Well, you can give your own column and row indexes:

data = [[1000, "Leorio", 86.58],
       [1001, "Kilua", 86.58],
       [1002, "Zenitsu", 86.58],
       [1003, "Tomioka", 86.58],
       [1004, "Inosuke", 86.58]]
pd.DataFrame(data, columns = ["Regd. No", "Name", "Marks"], index = [1,2,3,4,5])


Unnamed: 0,Regd. No,Name,Marks
1,1000,Leorio,86.58
2,1001,Kilua,86.58
3,1002,Zenitsu,86.58
4,1003,Tomioka,86.58
5,1004,Inosuke,86.58


 # You can create a DataFrame using dictionary:

In [6]:
data = {"Regd.No": [1000,1001,1002,1003,1004],
       "Names": ["Leorio","Killua","Zenitsu","Tomioka", "Inosuke"],
       "Marks%": [86.29,91.63, 72.90,69.23,88.30]}
pd.DataFrame(data)

Unnamed: 0,Regd.No,Names,Marks%
0,1000,Leorio,86.29
1,1001,Killua,91.63
2,1002,Zenitsu,72.9
3,1003,Tomioka,69.23
4,1004,Inosuke,88.3


## A Column is a Series
A DataFrame is a collection of series.

A series is a column in a table or a DataFrame.

There are 3 series in the given DataFrame - ‘Regd. No’, ‘Names’ and ‘Marks%’.

# Read and Write Files
## Reading Data Files
It is always good to be able to create a DataFrame by hand. But, generally, we don’t create our own data by hand. We work on the data that already exists.

Data exists in a number of formats. The most basic of these is the CSV file. CSV stands for comma-separated-values.
## What is a CSV file?
-CSV files are normally created by programs that handle large amounts of data. They are a convenient way to export data from spreadsheets and databases and import or use it in other programs.

-CSV is a simple file format used to store tabular data, such as a spreadsheet or database.

-A CSV file stores tabular data (numbers and text) in plain text.

-Each line of the file is a data record/row.
Each record consists of one or more fields, separated by commas.

The use of the comma as a field separator is the source of the name for this file format.

# Working with CSV files in Python
- For working with CSV files in python, there is an inbuilt module named csv.
- However, a common method for working with CSV files is using Pandas. It makes importing and analyzing data much easier.
- One crucial feature of Pandas is its ability to write and read Excel, CSV, and many other common types of files.
## Pandas read_csv
- Functions like the Pandas read_csv() method enable you to work with files effectively.
- The read_csv() function reads the CSV file into a DataFrame object.
- A CSV file is similar to a two-dimensional table and the DataFrame object represents two dimensional tabular view.
- The most basic way to read a csv file in Pandas:

In [7]:
# reading CSV file
#pd.read_csv("filename.csv")

- One can do many other things through this one function itself, to change the returned object completely.
- For instance, one can read a CSV file not only locally, but from a URL through read_csv, or one can choose what columns need to be imported so that we don’t have to edit the array later.
- These modifications can be done by the various arguments it takes.

# Pandas to_csv with example
The easiest way to write DataFrames to CSV files is using the Pandas to_csv function.
- Syntax:

In [8]:
# DataFrame to CSV file
# df is the name of the DataFrame here
# df.to_csv("file_name.csv")

- If you want to export without the index, simply add index=False:


In [9]:
# Specify index as False to import without index
# df.to_csv("file_name.csv", index = False)

# Basic Methods and Attributes - On Real World Dataset
## Dataset: Exam Scores
This dataset contains marks secured by different students in an examination and their background information.

## Read The dataset
So far you have learned how to read a csv file using read_csv() function. Let’s see the practical implementation.
read_csv() function is of pandas library. So the first task is always to import/load the library we will use.


In [10]:
import pandas as pd

In [11]:
# to read a csv file, use read_csv() function of pandas library.
exam_scores = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/exam_scores.csv")

## Methods and Attributes of DataFrame
- shape attribute
shape: It will help you to know what is the shape of your DataFrame, i.e., (number of rows, number of columns).
- After we have loaded the data we can check the shape of the DataFrame this attribute:

In [12]:
exam_scores.shape

(1000, 8)

- So our DataFrame has 1000 rows and 8 columns. From here, we can also say that our DataFrame has 8000 entries.

## head( ) method
- head( ): It will help you see the first five observations of your DataFrame. You can get some idea about the content of your DataFrame.
- head() gives us a quick look at the contents of the DataFrame, like column headers, data types of columns, what data each column has, etc.

In [13]:
exam_scores.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,male,group B,bachelor's degree,standard,none,74,68,67
1,female,group C,some college,standard,completed,58,68,66
2,male,group C,some college,free/reduced,none,66,65,65
3,female,group D,bachelor's degree,free/reduced,none,74,75,73
4,male,group D,some college,standard,none,78,77,71


- We can see the first five observations of the dataset in the above table.
## tail( ) method
- tail( ): This method is similar to head() method but instead of first five it will give you the last five observations from your dataset.

In [14]:
exam_scores.tail()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
995,female,group C,some high school,standard,none,68,77,72
996,female,group E,some college,standard,none,98,81,94
997,female,group E,associate's degree,free/reduced,none,67,67,67
998,female,group C,high school,standard,none,63,68,70
999,male,group C,some college,free/reduced,none,49,57,50


- We can see the last five observations of the dataset in the above table.


## head( ) and tail( )
We can also add the number of rows to be displayed in both head( ) and tail( ). See the examples below:

In [15]:
exam_scores.head(2)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,male,group B,bachelor's degree,standard,none,74,68,67
1,female,group C,some college,standard,completed,58,68,66


In [16]:
exam_scores.tail(2)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
998,female,group C,high school,standard,none,63,68,70
999,male,group C,some college,free/reduced,none,49,57,50


## dtypes
- dtypes: It will help you know about the data types of each column.
- To know the data types associated with each column, we can use dtypes attribute.


In [17]:
exam_scores.dtypes

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

We can observe from this output:

- 'gender', 'race/ethnicity', 'parental level of education', 'lunch' and 'test preparation course' are of data type - object.
- 'math score', 'reading score' and 'writing score' are of data type - int64 (i.e. integer).

## info( )
info( ): This method will return a concise summary about the DataFrame.

In [18]:
exam_scores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


# Use Case
## Real-World Use Case
One of the very first things you should learn is to select the relevant data quickly and effectively from a DataFrame. This is a very basic step that is needed in almost any data operation that you will run.

- For example:

Say, you are a data scientist at Amazon. During the time of New Year, your supervisor asked you to give a discount of 50% for all the electronic items available at Amazon.

You will never want to go to each electronic item and assign a discount of 50% to each of them. This would be a very time-consuming task. Instead, you would want to select all the items that fall under the category ‘Electronics’ and then give all of them a discount of 50% in one go. So easy, right? This is where you will need to index, select and assign data in a DataFrame.

# Indexing
## What is Indexing in Pandas?
- Indexing in Pandas means selecting particular rows and columns from a DataFrame.
- Indexing in Pandas is the same as we did for a Python List and a NumPy array.
- There are two different methods of indexing in Pandas:
  - loc - label based selection
  - iloc - index-based selection

# Index-Based Selection
- Index-based selection is to select data based on its numerical position in DataFrame.
- iloc is used for selecting data based on numerical position.
The syntax for using the iloc operator is


In [19]:
# df.iloc[ ]

where df is a DataFrame name. You can pass the numerical positions of rows and columns to select in the square bracket.
- Do you remember the indexing in a NumPy array? If not, don’t worry; you will soon see its implementation on a dataset.
- Import Pandas Library and load the ‘exam_scores.csv’ file

In [20]:
# select first row and column
exam_scores.iloc[0,0]

'male'

In [21]:
# select first five rows and 5th column - remeber python supports 0 based on indexing
# if you want to select 5th column that means column 4th index

In [22]:
exam_scores.iloc[0:5,4]

0         none
1    completed
2         none
3         none
4         none
Name: test preparation course, dtype: object

In [24]:
# if you don't provide any index before and after : in indexing it means it will select all the items
exam_scores.iloc[:,:]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,male,group B,bachelor's degree,standard,none,74,68,67
1,female,group C,some college,standard,completed,58,68,66
2,male,group C,some college,free/reduced,none,66,65,65
3,female,group D,bachelor's degree,free/reduced,none,74,75,73
4,male,group D,some college,standard,none,78,77,71
...,...,...,...,...,...,...,...,...
995,female,group C,some high school,standard,none,68,77,72
996,female,group E,some college,standard,none,98,81,94
997,female,group E,associate's degree,free/reduced,none,67,67,67
998,female,group C,high school,standard,none,63,68,70


In [25]:
# You can also pass list of indexes
exam_scores.iloc[[0,1,2,3],2]

0    bachelor's degree
1         some college
2         some college
3    bachelor's degree
Name: parental level of education, dtype: object

In [26]:
# You can also pass negative indexes
exam_scores.iloc[-5:, 0:2]

Unnamed: 0,gender,race/ethnicity
995,female,group C
996,female,group E
997,female,group E
998,female,group C
999,male,group C


# Label-Based Selection
- Label-based selection selects data based on the column or row names/index. This becomes important while selecting data from a DataFrame.
- Label-based selection is made with loc.
- Do you remember the python default indexing of the DataFrame and the indexes/names that you changed in the previous topics?
- loc and iloc are conceptually similar. The difference is that iloc considers the default indexing while loc ignores the default indexing.
- loc is used for selecting data based on the data index value/name, not the numerical positions.
The syntax for loc is similar to iloc:


In [27]:
# df.loc[]  , where df is the DataFrame name.

## Selecting data using loc:

In [28]:
# to get first entry in race/ethnicity
exam_scores.loc[0, "race/ethnicity"]

'group B'

In [29]:
# select first five rows for columns - gender,lunch and math score
exam_scores.loc[0:5, ["gender", "lunch", "math score"]]

Unnamed: 0,gender,lunch,math score
0,male,standard,74
1,female,standard,58
2,male,free/reduced,66
3,female,free/reduced,74
4,male,standard,78
5,female,standard,75


# Selecting
Attribute (Dot) Based Selection

In [33]:
# selecting a column
exam_scores.gender


0        male
1      female
2        male
3      female
4        male
        ...  
995    female
996    female
997    female
998    female
999      male
Name: gender, Length: 1000, dtype: object

# Dictionary (Bracket) Based Selection

In [34]:
exam_scores["gender"]

0        male
1      female
2        male
3      female
4        male
        ...  
995    female
996    female
997    female
998    female
999      male
Name: gender, Length: 1000, dtype: object

## Selecting Multiple Columns
- While selecting multiple columns, we use double square brackets [[]].

In [35]:
exam_scores[["lunch","gender"]]

Unnamed: 0,lunch,gender
0,standard,male
1,standard,female
2,free/reduced,male
3,free/reduced,female
4,standard,male
...,...,...
995,standard,female
996,standard,female
997,free/reduced,female
998,standard,female


## Assigning
- Assigning data to a DataFrame is very easy. You can do this using one line of code.

In [36]:
# exam_scores.lunch = "Standard"


# Summary Functions
## Learned Till Now
So far, you have learned to read a data file, some methods to check the overview of data, and select data from a DataFrame.

The information that we got about the data is not enough. A data scientist always wants to understand the behavior of data. For numerical columns we may want to know the mean value or median values of the data, the minimum value in the column and the maximum value in the column, etc. For categorical columns, we are interested to know things like the number of different categories in a column, the count of each category in a column, the maximum occurring category in a column, etc. Here we will look at some of the techniques that will help you know the above information about your data.

## What is Summary?
- Summary is a term used for the short version of a longer work.
- Summary is a brief statement of the main points of something.
- Confusing? Let’s understand through an example.
## Summary Functions
- Pandas has many simple "summary functions" (well, this is not an official name) that help you to restructure your data in a very useful way and displays useful information about the data.
- Do you recall the info() method that we had used in the earlier module? This method has given us 6 to 7 main points about the data like the number of observations and their range indices, names of columns with their data types and number of non-null entries in the particular column, etc.
- So info() is also a summary function/method.
- Here we will see another summary function - describe().
- Import Pandas Library and load the dataset ‘exam_scores.csv’:
## describe( )
By default, the describe() method returns a summary of numerical columns only.

In [37]:
exam_scores.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,67.128,70.174,68.973
std,14.815367,14.85599,15.109155
min,15.0,18.0,10.0
25%,58.0,60.0,59.0
50%,67.0,70.0,69.0
75%,78.0,81.0,80.0
max,100.0,100.0,100.0


If we want to get a summary of categorical columns separately, then we can use the parameter 'include'.

In [38]:
exam_scores.describe(include = "object")

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course
count,1000,1000,1000,1000,1000
unique,2,5,6,2,2
top,female,group C,some college,standard,none
freq,502,294,226,649,654


The returned summary is all about the categorical columns.
Also, we can get a summary of numerical and categorical columns together using the same parameter 'include'.


In [39]:
exam_scores.describe(include = "all")

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
count,1000,1000,1000,1000,1000,1000.0,1000.0,1000.0
unique,2,5,6,2,2,,,
top,female,group C,some college,standard,none,,,
freq,502,294,226,649,654,,,
mean,,,,,,67.128,70.174,68.973
std,,,,,,14.815367,14.85599,15.109155
min,,,,,,15.0,18.0,10.0
25%,,,,,,58.0,60.0,59.0
50%,,,,,,67.0,70.0,69.0
75%,,,,,,78.0,81.0,80.0


- We can see the above table includes a summary of both numerical and categorical columns. A categorical column cannot have minimum or maximum values or mean and median. So the entries in these fields for categorical columns are NaN.
- We have seen in a previous module that this dataset has 5 categorical variables (gender, race/ethnicity, parental level of education, lunch and test preparation course) and 3 numerical (i.e., integer) columns (math score, reading score, and writing score).

## Let's see what information about the data is returned in the above table:

- count - the count of non-null entries in the particular column. For example, gender column has 1000 non-null entries.
- unique - the count of unique values in a column. Only for categorical columns. For example, gender column has 2 unique values - male and female.
- top - this is also for only categorical columns. This tells us which category is occurring maximum number of times. For example in gender column 'female' is occurring maximum number of times.
- freq - this is again for categorical columns only. This tells you the number of occurrences of the top category in that column. For example, 'female' in gender column is occurring 502 times.
- mean - the mean value of the numerical column. For example, the mean math score is 67.128.
- std - this is the standard deviation of the numerical column. This tells you about the variation in the data. Don't worry if you don't know about it.
- min - the minimum value in the numerical column. For example, the minimum math score is 15.0
- 25% - the 25th percentile (or 1st quartile) value in the numerical column. For example, the 25th percentile value for math score is 58.0.
- 50% - the 50th percentile (or 2nd quartile or the median) value in the numerical column. For example, the median math score is 67.0.
- 75% - the 75th percentile (or 3rd quartile) value in the numerical column. For example, the 75th percentile value for math score is 78.0.
- max - the maximum value in the numerical column. For example, the maximum math score is 100.0.
- NaN values means that for a particular column, a particular summary value is not available. For example, gender (a categorical column), does not have mean value or median value as these are the properties of a numerical column only.
### We can also use describe() method on a particular column/series:

In [41]:
# getting summary of a particular column (math score)
exam_scores["math score"].describe()

count    1000.000000
mean       67.128000
std        14.815367
min        15.000000
25%        58.000000
50%        67.000000
75%        78.000000
max       100.000000
Name: math score, dtype: float64

In [42]:
# # getting summary of a particular column (gender)
exam_scores["gender"].describe()

count       1000
unique         2
top       female
freq         502
Name: gender, dtype: object

- If the column name does not contain any space in it, you can use attribute style of selecting a column. For example we used attribute style for selecting 'gender' but not for 'math score' as it contains a space.
- If you use attribute style of selection for ‘math score’, Python will throw an error.


In [43]:
# for example
exam_scores.math score

SyntaxError: invalid syntax (<ipython-input-43-061379d3392c>, line 2)

# Aggregation Functions
- We saw the use of describe() method on a DataFrame or a series which returned some information (i.e. the summary) about the data.
- We can also use the individual methods like mean(), median(), unique() to get this information on a DataFrame or a series.
- Examples are shown below:

In [45]:
# Mean of all these three numerical columns
exam_scores.mean()   #only numerical columns

math score       67.128
reading score    70.174
writing score    68.973
dtype: float64

In [46]:
exam_scores.gender.unique()   # returns all unique values in the column "gender"

array(['male', 'female'], dtype=object)

- To see all the unique values and the number of times they are occurring in the dataset, we have a method called value_counts():

In [47]:
exam_scores.gender.value_counts()

female    502
male      498
Name: gender, dtype: int64

- 'female' is occurring 502 times in gender column which we saw in the table returned using describe() method as well.

# Sorting
- Assume you are a teacher and you have scores of your students in the dataset ‘exam_scores’ (say). You want to get the information about all the students whose performance is very bad in ‘math’.

- What about going through all the records in the dataset and finding out manually the students whose score in math is very bad?

- You can think of doing this manual task if you have 10 or 20 records. But here you have 1000 records. You need to figure out some easy way to do this. This is where the importance of sorting comes in.

- The data present in the DataFrame ‘exam_scores’ is in the default index order not in a value order.

- Pandas provides a method called sort_values() which returns the sorted result in value order.

- Let's say we want to get the student's information which are in increasing order of math scores.

In [48]:
exam_scores.sort_values(by = "math score").head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
949,male,group C,high school,free/reduced,none,15,18,10
854,male,group C,high school,free/reduced,none,18,30,18
891,female,group C,high school,free/reduced,none,23,31,27
349,male,group B,some high school,free/reduced,none,25,31,30
739,female,group E,some high school,free/reduced,none,25,35,38


- By default the sorting happens in an ascending order:

We can get the sorted result in descending (decreasing) order by passing the parameter ‘ascending’ as False in the sort_values() method:

In [49]:
exam_scores.sort_values(by = "math score", ascending= False).head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
91,male,group D,associate's degree,standard,none,100,94,96
776,male,group E,associate's degree,standard,completed,100,100,100
588,male,group D,some college,standard,completed,100,85,91
128,male,group A,associate's degree,standard,completed,100,97,94
565,male,group E,bachelor's degree,standard,completed,100,100,100


- We can also sort a series using the sort_values() method:


In [50]:
exam_scores["math score"].sort_values(ascending = False)[:5]

91     100
776    100
588    100
128    100
565    100
Name: math score, dtype: int64

# Renaming
Renaming means changing the name.

- Most of the time we get dataset where column names are not satisfactory.

- For example in this dataset, 'math score', 'reading score' and 'writing score' contain spaces in their names due to which we are not able to use attribute (dot) selection style method to select a particular column.

- And not only this if the column names were 'ms' for math score, 'rs' for reading score and 'ws' for writing score, these are not informative column names. This is the need for renaming columns.

- Pandas provides a function ‘rename()’ to rename column/indexes in a DataFrame.

- Let's rename all the column names in our DataFrame which contain spaces in them. Also we will rename the column 'race/ethnicity' to 'race'.



In [52]:
exam_scores.rename(columns={
    "race/ethnicity":"race",
    "parental level of education":"parent_education_level",
    "test preparationcourse":"test_prep_course",
    "math score": "math_score",
    "reading score": "reading_score",
    "writing score":"writing_score"}, inplace=True
                  )

# inplace = True will make these changes in the dataframe "exam_scores" only

- We can get the list of all the columns using the attribute columns.

In [53]:
exam_scores.columns


Index(['gender', 'race', 'parent_education_level', 'lunch',
       'test preparation course', 'math_score', 'reading_score',
       'writing_score'],
      dtype='object')

- We can also rename the indexes of the DataFrame using the rename()

In [54]:
exam_scores.rename(index = {0:"zero", 1:"one"})

Unnamed: 0,gender,race,parent_education_level,lunch,test preparation course,math_score,reading_score,writing_score
zero,male,group B,bachelor's degree,standard,none,74,68,67
one,female,group C,some college,standard,completed,58,68,66
2,male,group C,some college,free/reduced,none,66,65,65
3,female,group D,bachelor's degree,free/reduced,none,74,75,73
4,male,group D,some college,standard,none,78,77,71
...,...,...,...,...,...,...,...,...
995,female,group C,some high school,standard,none,68,77,72
996,female,group E,some college,standard,none,98,81,94
997,female,group E,associate's degree,free/reduced,none,67,67,67
998,female,group C,high school,standard,none,63,68,70


# Checking and Filling Missing Data
## What is a Missing Value?
- If in any row or column in a DataFrame, a value is not available, it is said to be a missing value.
- So, defining missing data: Missing data (or missing values) is defined as the data values that are not stored in a column or row.
- Pandas provides isnull(), isna() functions to detect missing values. Both of them do the same thing.

- df.isna() or df.isnull() returns the DataFrame with Boolean values indicating whether a value is missing (True) or not (False).
- We can get column wise count of all the missing values using the aggregation function sum():


In [55]:
# df.isnull().sum()

- Pandas also provides fillna() method to fill the missing values. fillna() provides many different strategies to fill missing values.

- Let's say we want to fill the missing values in 'Names' column with 'unknown'.

In [56]:
# df.Names.fillna("unknown")

We can pass 'inplace' parameter as True in fillna() method. It will make the changes in the original DataFrame.

In [57]:
# df.Names.fillna("unknown", inplace = True)
# df