## Outline
- DataFrames
    - sorting & subsetting
    - creating new columns
- Aggregating Data
    - summary stattistics
    - counting
    - grouped summary statistics
- Slicing & indexing data
    - subsetting using slicing
    - indexes & subsetting using indexes
- Creating & Visualizing Data
    - plotting
    - handling missing values
    - reading data into DataFrame

- Pandas is built on NumPy and Matplotlib

## DataFrames

In [214]:
import numpy as np
import pandas as pd

In [243]:
dogs = pd.read_csv('dataset/dogs.csv')
titanic = pd.read_csv('dataset/titanic.csv')
dogs.head()
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [216]:
dogs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Name           7 non-null      object
 1   Breed          7 non-null      object
 2   Color          7 non-null      object
 3   Height         7 non-null      int64 
 4   Weight         7 non-null      int64 
 5   Date of Birth  7 non-null      object
dtypes: int64(2), object(4)
memory usage: 464.0+ bytes


In [217]:
dogs.shape

(7, 6)

In [218]:
dogs.describe()

Unnamed: 0,Height,Weight
count,7.0,7.0
mean,49.714286,27.428571
std,17.960274,22.292429
min,18.0,2.0
25%,44.5,19.5
50%,49.0,23.0
75%,57.5,27.0
max,77.0,74.0


In [219]:
dogs.values

array([['Bella', ' Labrador', ' Brown', 56, 25, ' 2013-07-01'],
       ['Charlie', ' Poodle', ' Black', 43, 23, ' 2016-09-16'],
       ['Lucy', ' Chow', ' Brown', 46, 22, ' 2014-08-25'],
       ['Cooper', ' Schnauzer', ' Gray', 49, 17, ' 2011-12-11'],
       ['Max', ' Labrador', ' Black', 59, 29, ' 2017-01-20'],
       ['Stella', ' Chihuahua', ' Tan', 18, 2, ' 2015-04-20'],
       ['Bernie', ' St. Bernard', ' White', 77, 74, ' 2018-02-2']],
      dtype=object)

In [220]:
dogs.columns

Index(['Name', 'Breed', 'Color', 'Height', 'Weight', 'Date of Birth'], dtype='object')

In [221]:
dogs.index

RangeIndex(start=0, stop=7, step=1)

##### 1- sorting &  subsetting

###### sorting

In [222]:
dogs.sort_values(by='Name')

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
6,Bernie,St. Bernard,White,77,74,2018-02-2
1,Charlie,Poodle,Black,43,23,2016-09-16
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
2,Lucy,Chow,Brown,46,22,2014-08-25
4,Max,Labrador,Black,59,29,2017-01-20
5,Stella,Chihuahua,Tan,18,2,2015-04-20


In [223]:
dogs.sort_values(by='Breed', ascending='False', na_position='first')

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth
5,Stella,Chihuahua,Tan,18,2,2015-04-20
2,Lucy,Chow,Brown,46,22,2014-08-25
0,Bella,Labrador,Brown,56,25,2013-07-01
4,Max,Labrador,Black,59,29,2017-01-20
1,Charlie,Poodle,Black,43,23,2016-09-16
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
6,Bernie,St. Bernard,White,77,74,2018-02-2


In [224]:
dogs.sort_values('Name')

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
6,Bernie,St. Bernard,White,77,74,2018-02-2
1,Charlie,Poodle,Black,43,23,2016-09-16
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
2,Lucy,Chow,Brown,46,22,2014-08-25
4,Max,Labrador,Black,59,29,2017-01-20
5,Stella,Chihuahua,Tan,18,2,2015-04-20


In [225]:
dogs.sort_values(['Breed','Color'], ascending=[True,False])

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth
5,Stella,Chihuahua,Tan,18,2,2015-04-20
2,Lucy,Chow,Brown,46,22,2014-08-25
0,Bella,Labrador,Brown,56,25,2013-07-01
4,Max,Labrador,Black,59,29,2017-01-20
1,Charlie,Poodle,Black,43,23,2016-09-16
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
6,Bernie,St. Bernard,White,77,74,2018-02-2


###### Subsetting

In [226]:
# dogs[['Name', 'Survived']]
dogs_info = ['Name', 'Breed']
dogs[dogs_info]

Unnamed: 0,Name,Breed
0,Bella,Labrador
1,Charlie,Poodle
2,Lucy,Chow
3,Cooper,Schnauzer
4,Max,Labrador
5,Stella,Chihuahua
6,Bernie,St. Bernard


In [234]:
dogs

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
1,Charlie,Poodle,Black,43,23,2016-09-16
2,Lucy,Chow,Brown,46,22,2014-08-25
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
4,Max,Labrador,Black,59,29,2017-01-20
5,Stella,Chihuahua,Tan,18,2,2015-04-20
6,Bernie,St. Bernard,White,77,74,2018-02-2


In [227]:
print(dogs['Weight'] > 20.0)

0     True
1     True
2     True
3    False
4     True
5    False
6     True
Name: Weight, dtype: bool


In [228]:
dogs['Height']> 30

0     True
1     True
2     True
3     True
4     True
5    False
6     True
Name: Height, dtype: bool

In [238]:
# dogs[dogs['Sex'] == 'male']
# Similarly subsetting is possible on date
# we can pass variable also
variable = dogs['Color'] == ' Black'
dogs[variable]

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth
1,Charlie,Poodle,Black,43,23,2016-09-16
4,Max,Labrador,Black,59,29,2017-01-20


In [241]:
dog_black = dogs['Color'] == ' Black'
dog_poodle = dogs['Breed'] == ' Poodle'
dogs[dog_black & dog_poodle]

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth
1,Charlie,Poodle,Black,43,23,2016-09-16


In [244]:
dogs[dogs['Color'].isin([' Tan',' Gray'])]

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
5,Stella,Chihuahua,Tan,18,2,2015-04-20


#### 2- Creating New Columns

In [246]:
dogs['dog_bmi'] = dogs['Weight']/dogs['Height']**2
dogs

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth,bmi,dog_bmi
0,Bella,Labrador,Brown,56,25,2013-07-01,0.007972,0.007972
1,Charlie,Poodle,Black,43,23,2016-09-16,0.012439,0.012439
2,Lucy,Chow,Brown,46,22,2014-08-25,0.010397,0.010397
3,Cooper,Schnauzer,Gray,49,17,2011-12-11,0.00708,0.00708
4,Max,Labrador,Black,59,29,2017-01-20,0.008331,0.008331
5,Stella,Chihuahua,Tan,18,2,2015-04-20,0.006173,0.006173
6,Bernie,St. Bernard,White,77,74,2018-02-2,0.012481,0.012481


In [247]:
Height_lt_50 = dogs[dogs.Height < 50]
desending_order = Height_lt_50.sort_values('Height', ascending=False)
desending_order[['Name', 'Breed', 'Height']]

Unnamed: 0,Name,Breed,Height
3,Cooper,Schnauzer,49
2,Lucy,Chow,46
1,Charlie,Poodle,43
5,Stella,Chihuahua,18


## Aggregating Data

#### Summary statistics

### Summarizing numerical data
- .mean()
- .median()
- .min()
- .maxx()
- .var()
- .std()
- .sum()
- .quantile()

In [248]:
dogs['Height'].mean()

49.714285714285715

In [249]:
dogs['Height'].mode()

0    18
1    43
2    46
3    49
4    56
5    59
6    77
dtype: int64

In [250]:
dogs.Weight.min()

2

In [251]:
dogs.Weight.max()

74

In [252]:
dogs['Height'].var() #<--Return unbiased variance over requested axis.

322.5714285714286

In [253]:
dogs['Height'].quantile() #<--Return values at the given quantile over requested axis.

49.0

In [254]:
dogs['Weight'].std()

22.29242878091979

In [255]:
dogs['Weight'].sum()

192

### summarizing dates

## .agg() method

- One or more operation on single Or multiple columns

In [256]:
dogs.head()

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth,bmi,dog_bmi
0,Bella,Labrador,Brown,56,25,2013-07-01,0.007972,0.007972
1,Charlie,Poodle,Black,43,23,2016-09-16,0.012439,0.012439
2,Lucy,Chow,Brown,46,22,2014-08-25,0.010397,0.010397
3,Cooper,Schnauzer,Gray,49,17,2011-12-11,0.00708,0.00708
4,Max,Labrador,Black,59,29,2017-01-20,0.008331,0.008331


##### on Single column

In [257]:
def pct30(column):return column.quantile(0.3)

In [258]:
dogs['Weight'].agg(pct30)#<-- applying agg() on a column using simple function

21.0

In [259]:
dogs['Weight'].agg(lambda col : col.quantile(.3)) #<-- using lambda function

21.0

##### on multiple column

In [260]:
dogs[['Weight', 'Height']].agg(lambda x: x.quantile(0.3))

Weight    21.0
Height    45.4
dtype: float64

##### multiple summaries

In [262]:
#def pct30(column): return column.quantile(0.3)
def pct40(column): return column.quantile(0.4)

In [263]:
dogs['Weight'].agg([pct30,pct40])

pct30    21.0
pct40    22.4
Name: Weight, dtype: float64

### cumulative statistics
- .cumsum()
- .cummax()
- .cummin()
- .cumprod()

In [264]:
pd.DataFrame(dogs['Weight'].cumsum()).head(4)

Unnamed: 0,Weight
0,25
1,48
2,70
3,87


## Counting

- So far, in this chapter, you've learned how to ``summarize numeric variables``. In below notebook, you'll learn how to ``summarize categorical data`` using counting.

- Categorical variables represent types of **data which may be divided into groups**. Examples of categorical variables are race, sex, age group, and educational

#### Dropping duplicate names

In [21]:
dogs.drop_duplicates(subset = "Breed")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [266]:
dogs.drop_duplicates(subset = ["Breed", 'Color'])

Unnamed: 0,Name,Breed,Color,Height,Weight,Date of Birth,bmi,dog_bmi
0,Bella,Labrador,Brown,56,25,2013-07-01,0.007972,0.007972
1,Charlie,Poodle,Black,43,23,2016-09-16,0.012439,0.012439
2,Lucy,Chow,Brown,46,22,2014-08-25,0.010397,0.010397
3,Cooper,Schnauzer,Gray,49,17,2011-12-11,0.00708,0.00708
4,Max,Labrador,Black,59,29,2017-01-20,0.008331,0.008331
5,Stella,Chihuahua,Tan,18,2,2015-04-20,0.006173,0.006173
6,Bernie,St. Bernard,White,77,74,2018-02-2,0.012481,0.012481


#### .values_count()

In [267]:
pd.DataFrame(dogs['Weight'].value_counts()) # Sort by Default True

Unnamed: 0,Weight
25,1
23,1
22,1
17,1
29,1
2,1
74,1


In [268]:
pd.DataFrame(dogs['Weight'].value_counts(sort=False))

Unnamed: 0,Weight
25,1
23,1
22,1
17,1
29,1
2,1
74,1


 - normalize argument can be used to turn the counts into proportions of the total. 25%, 50%, 75%

In [269]:
pd.DataFrame(dogs['Weight'].value_counts(normalize=True))

Unnamed: 0,Weight
25,0.142857
23,0.142857
22,0.142857
17,0.142857
29,0.142857
2,0.142857
74,0.142857


## Group summary satistics

In [270]:
dogs[dogs['Color'] == ' Black']['Height'].mean()

51.0

In [271]:
dogs.groupby('Weight')['Height'].mean()

Weight
2     18.0
17    49.0
22    46.0
23    43.0
25    56.0
29    59.0
74    77.0
Name: Height, dtype: float64

In [272]:
pd.DataFrame(dogs.groupby('Weight')['Height'].mean())

Unnamed: 0_level_0,Height
Weight,Unnamed: 1_level_1
2,18.0
17,49.0
22,46.0
23,43.0
25,56.0
29,59.0
74,77.0


In [273]:
dogs.groupby(['Breed', 'Color'])['Weight'].count() # < -- multiple group

Breed         Color 
 Chihuahua     Tan      1
 Chow          Brown    1
 Labrador      Black    1
               Brown    1
 Poodle        Black    1
 Schnauzer     Gray     1
 St. Bernard   White    1
Name: Weight, dtype: int64

In [274]:
dogs.groupby('Breed')['Color'].agg(['count', 'min', 'max'])# <-- multiple stats

Unnamed: 0_level_0,count,min,max
Breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chihuahua,1,Tan,Tan
Chow,1,Brown,Brown
Labrador,2,Black,Brown
Poodle,1,Black,Black
Schnauzer,1,Gray,Gray
St. Bernard,1,White,White


In [275]:
dogs.groupby(['Breed', 'Color'])[['Weight', 'Height']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Weight,Height
Breed,Color,Unnamed: 2_level_1,Unnamed: 3_level_1
Chihuahua,Tan,2.0,18.0
Chow,Brown,22.0,46.0
Labrador,Black,29.0,59.0
Labrador,Brown,25.0,56.0
Poodle,Black,23.0,43.0
Schnauzer,Gray,17.0,49.0
St. Bernard,White,74.0,77.0


In [276]:
dogs.groupby(['Breed', 'Color'])[['Weight', 'Height']].agg(['count', 'min', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Weight,Weight,Weight,Height,Height,Height
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,max,count,min,max
Breed,Color,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Chihuahua,Tan,1,2,2,1,18,18
Chow,Brown,1,22,22,1,46,46
Labrador,Black,1,29,29,1,59,59
Labrador,Brown,1,25,25,1,56,56
Poodle,Black,1,23,23,1,43,43
Schnauzer,Gray,1,17,17,1,49,49
St. Bernard,White,1,74,74,1,77,77


## Pivot tables
**Signature**:
dogs.pivot_table(
    values=None,
    index=None,
    columns=None,
    aggfunc='mean',
    fill_value=None,
    margins=False,
    dropna=True,
    margins_name='All',
    observed=False,
)

In [277]:
dogs.groupby('Weight')['Height'].mean()

Weight
2     18.0
17    49.0
22    46.0
23    43.0
25    56.0
29    59.0
74    77.0
Name: Height, dtype: float64

- The ``"values"`` argument is the column that you want to ``summarize/Operation``, and the ``"index"`` column is the column that you want to ``group by``. 
- By default, pivot_table takes the **mean** value for each group.

In [281]:
#pivot and implicitly define agffunc=np.mean
dogs.pivot_table(values = 'Weight', index='Color')

Unnamed: 0_level_0,Weight
Color,Unnamed: 1_level_1
Black,26.0
Brown,23.5
Gray,17.0
Tan,2.0
White,74.0


In [282]:
#explicitly define statistics i:e np.median
dogs.pivot_table(values = 'Weight', index='Color', aggfunc=np.median)

Unnamed: 0_level_0,Weight
Color,Unnamed: 1_level_1
Black,26.0
Brown,23.5
Gray,17.0
Tan,2.0
White,74.0


In [283]:
#multiple statistics
dogs.pivot_table(values = 'Weight', index='Color', aggfunc=[np.std, np.median])

Unnamed: 0_level_0,std,median
Unnamed: 0_level_1,Weight,Weight
Color,Unnamed: 1_level_2,Unnamed: 2_level_2
Black,4.242641,26.0
Brown,2.12132,23.5
Gray,,17.0
Tan,,2.0
White,,74.0


#### pivot on two varibales

- To group by two variables, we can pass a **second variable name into the columns argument**.

In [285]:
#in groupby

#dogs.groupby(['Survived','Sex'])['Age'].mean().unstack()

#pivot on two varibales
dogs.pivot_table(values = 'Weight', index='Color', columns='Name')

Name,Bella,Bernie,Charlie,Cooper,Lucy,Max,Stella
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Black,,,23.0,,,29.0,
Brown,25.0,,,,22.0,,
Gray,,,,17.0,,,
Tan,,,,,,,2.0
White,,74.0,,,,,


#### filling missing values in pivot table

In [286]:
dogs.pivot_table(values = 'Weight', index='Color', columns='Name', fill_value=0)

Name,Bella,Bernie,Charlie,Cooper,Lucy,Max,Stella
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Black,0,0,23,0,0,29,0
Brown,25,0,0,0,22,0,0
Gray,0,0,0,17,0,0,0
Tan,0,0,0,0,0,0,2
White,0,74,0,0,0,0,0


#### summing with pivot table

- Using margins equals True allows us to see a summary statistic for multiple levels of the dataset: the entire dataset, grouped by one variable, by another variable, and by two variables.

In [287]:
dogs.pivot_table(values = 'Weight',
                 index='Color',
                 columns='Name',
                 fill_value=0,
                 margins=True)

Name,Bella,Bernie,Charlie,Cooper,Lucy,Max,Stella,All
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Black,0,0,23,0,0,29,0,26.0
Brown,25,0,0,0,22,0,0,23.5
Gray,0,0,0,17,0,0,0,17.0
Tan,0,0,0,0,0,0,2,2.0
White,0,74,0,0,0,0,0,74.0
All,25,74,23,17,22,29,2,27.428571


### Thanks:)

- Assignment work on Dogs Dataset