# Discussion

## 1. What is a dataframe?

A DataFrame is a two-dimensional labeled data structure in pandas, where each column can have a different data type (e.g., integer, float, string)
and each row represents a record. 

## 2. What information can we obtain about a dataframe?

.info() can be used to print a concise summary with number of non-null, data types, and memory usage

.describe() can be used to generate summary statistics of a DF or Series. It includes the following for each column:

    count
    
    mean
    
    std - the standard deviation
    
    min
    
    25% - the 25th percentile
    
    50%
    
    75%
    
    max

.describe(include='all') can be used for summary statistics of non-numeric values

**These are attributes that can be used, they do not require ()**

    .dtypes the datatype for each column

    .shape the number of rows and columns

    .columns the list of column names

    .index, the labels for each row

## 3. How have we interacted with a dataframe?

Subsetting to access multiple columns with a list of strings

    .head() top 5 by default
    .tail() bottom by by default
    .sample() 1 by default


**Boolean mask/series**

**Drops columns**

df.drop(columns=['english', 'reading'])


**Renames columns**

df.rename(columns={'name': 'student'})


**Creating columns**

df['passing_math'] = df.math > 70


**Add another column using .assign**

df.assign(passing_english=df.english >= 70)


**Sorting DF by columns**

df.sort_values(by='english', ascending=False)   ascending is True by default


**Method chaining**

df[df.english > 90].sort_values(by='english').head(1).name


**loc and iloc**

*loc*
Select all the rows and a subset of columns; notice the inclusive behavior of the indexing.

    df.loc[:, 'math':'reading']

*iloc*
Same selection using index
Notice the exclusive behavior of the indexing.

    df.iloc[:3]


**Aggregating for math summary stats**

    df[['english', 'reading', 'math']].agg(['mean', 'min', 'max'])


**Group By for another way to summarize**

The highest math grade from each classroom:
    df.groupby('classroom').math.max()


**np.where**

Create the new column based on an existing column.

    df['passing_math'] = np.where(df.math < 70, 'failing', 'passing')


**Concat**

    pd.concat([df1, df2], axis=0) axis 0 is default, this will put the second df under the first


**Merge** 

    df.merge default settings for commonly used parameters. Very similar to JOIN in SQL

how == Type of merge to be performed.

how=left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

how=right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

how=outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

how=inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

    left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, indicator=False)


# Pactice Exercises - July 31st

In [2]:
import pandas as pd

df = pd.read_csv('https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv')

#### Information about a dataframe

1. Obtain the following information:

In [180]:
# dimensions ATTRIBUTE
df.shape

(800, 13)

In [181]:
# dtypes ATTRIBUTE
df.dtypes

#              int64
Name          object
Type 1        object
Type 2        object
Total          int64
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

In [182]:
# column names ATTRIBUTE
df.columns

Index(['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

In [7]:
# summary statistics
df.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,362.81375,435.1025,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,208.343798,119.96304,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,184.75,330.0,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,364.5,450.0,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,539.25,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,780.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


In [8]:
# summary statistics
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB


#### Working with dataframes

In [184]:
# 1. What is the highest HP value present?
    # Create a function named highest_attack.
    # Use the loaded dataframe as an argument.
    


In [9]:
def highest_hp(df):
    return df['HP'].max()

highest_hp(df)

255

In [10]:
def highest_attack(df):
    return df['Attack'].max()

highest_attack(df)

190

In [186]:
# 2. Which Pokemon possess(es) the highest HP value?

# df[df.HP == df.HP.max()] using dot notation

df[df['HP'] == df['HP'].max()] # using bracket notation

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
261,242,Blissey,Normal,,540,255,10,10,75,135,55,2,False


In [13]:
df['HP'].max()

df.loc[261]

#                 242
Name          Blissey
Type 1         Normal
Type 2            NaN
Total             540
HP                255
Attack             10
Defense            10
Sp. Atk            75
Sp. Def           135
Speed              55
Generation          2
Legendary       False
Name: 261, dtype: object

In [14]:
 # How many different types are represented in Type 1?

# Create a function named num_types
# Use the loaded dataframe as an argument

def num_types(df):
    return df['Type 1'].nunique()

num_types(df)

18

In [23]:
# 4. Number of Pokemon whose Type 2 is Ghost

t2ghosts = df[df['Type 2'] == 'Ghost']

t2ghosts = len(t2ghosts)

t2ghosts

14

In [26]:
# 5. Percentage of Pokemon whose Type 2 is Ghost

    # Create a function named percent_ghost
    # Use the loaded dataframe as an argument
    
def percent_ghost(df):
    num_ghost = len(df[df['Type 2'] == 'Ghost'])
    num_total = len(df)
    percent = num_ghost / num_total * 100
    return (f'{percent:.2f}')

percent_ghost(df)

'1.75'

In [27]:
# 6. Number of Pokemon whose Attack is greater than Defense

len(df[df['Attack'] > df['Defense']])

433

In [193]:
# 7. Lowest speed for Grass or Rock

grass_or_rock = df['Type 1'].isin(['Grass', 'Rock']) | df['Type 2'].isin(['Grass', 'Rock'])

lowest_speed = df['Speed'].min()

df[grass_or_rock & (df['Speed'] == lowest_speed)]

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
230,213,Shuckle,Bug,Rock,505,20,10,230,10,230,5,2,False


In [194]:
# Get the Lowest speed for grass or rock
df.loc[grass_or_rock]['Speed'].min()

5

# Practice Exercises - August 1st

In [28]:
import pandas as pd

df = pd.read_csv('https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv')

In [30]:
# 10. Change all the column names, lowercase letters, remove whitespace

df.columns = df.columns.str.lower().str.replace(' ', '_')

df.head(1)

Unnamed: 0,#,name,type_1,type_2,total,hp,attack,defense,sp._atk,sp._def,speed,generation,legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False


In [197]:
# 11. Rename Sp. Attack to special-attack

# REASSIGN VARIABLE OR USE inplace=True

df.rename(columns={'sp._atk': 'special-attack'}, inplace=True)

df.head(1)

Unnamed: 0,#,name,type_1,type_2,total,hp,attack,defense,special-attack,sp._def,speed,generation,legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False


In [198]:
# 12. Rename Sp. Def to special-defense

# REASSIGN VARIABLE OR USE inplace=True

df.rename(columns={'sp._def': 'special-defense'}, inplace=True)

df.head(1)

Unnamed: 0,#,name,type_1,type_2,total,hp,attack,defense,special-attack,special-defense,speed,generation,legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False


In [199]:
# 13.1 Which Pokemon has/have the greatest difference in Attack and Defense points?
    
df['atk_def_dif'] = abs(df['attack'] - df['defense'])

df.sort_values('atk_def_dif', ascending=False).head(1)

Unnamed: 0,#,name,type_1,type_2,total,hp,attack,defense,special-attack,special-defense,speed,generation,legendary,atk_def_dif
230,213,Shuckle,Bug,Rock,505,20,10,230,10,230,5,2,False,220


In [200]:
# 13.2 Which Pokemon has/have the greatest difference Special Attack and Special Defense?

df['sp_atk_sp_def_dif'] = abs(df['special-attack'] - df['special-defense'])

df.sort_values('sp_atk_sp_def_dif', ascending=False).head(1)

Unnamed: 0,#,name,type_1,type_2,total,hp,attack,defense,special-attack,special-defense,speed,generation,legendary,atk_def_dif,sp_atk_sp_def_dif
230,213,Shuckle,Bug,Rock,505,20,10,230,10,230,5,2,False,220,220


In [201]:
# 14. How many Pokemon are of Type 1 Rock and Type 2 Fairy?

rock_and_fairy = df[(df['type_1'] == 'Rock') & (df['type_2'] == 'Fairy')]

rock_and_fairy.count().head(1)

#    3
dtype: int64

In [202]:
# 15.  Which Fire Pokemon appears last alphabetically?
    # Create a function named last_pokemon
    # Use the loaded dataframe as an argument

def last_pokemon(df):
    last_fire = df.loc[df['type_1'] == 'Fire', 'name'].max()
    return last_fire

last_pokemon(df)

'Vulpix'

In [203]:
# Bonus
# Find the average speed by Generation

df.groupby('generation')['speed'].mean()

generation
1    72.584337
2    61.811321
3    66.925000
4    71.338843
5    68.078788
6    66.439024
Name: speed, dtype: float64

In [204]:
# Create a function which accepts:
# a dataframe
# two different Pokemon stats e.g. Attack and Defense
    # The function will:
    # Calculate difference between the two given Pokemon stats
    # Create a column which contains the difference of the two given Pokemon stats named stats-diff
    
def stat_dif(df, stat1, stat2):
    dif = abs(df[stat1] - df[stat2])
    df[f'stats-diff-{stat1}-&-{stat2}'] = dif
    return df

stat_dif(df, 'attack', 'defense') 

Unnamed: 0,#,name,type_1,type_2,total,hp,attack,defense,special-attack,special-defense,speed,generation,legendary,atk_def_dif,sp_atk_sp_def_dif,stats-diff-attack-&-defense
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,0,0,0
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,1,0,1
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,1,0,1
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,23,2,23
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,9,10,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True,50,50,50
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True,50,50,50
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True,50,20,50
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True,100,40,100
