# Discussion

## 1. What is a dataframe?

A DataFrame is a two-dimensional labeled data structure in pandas, where each column can have a different data type (e.g., integer, float, string)
and each row represents a record. 

## 2. What information can we obtain about a dataframe?

.info() can be used to print a concise summary with number of non-null, data types, and memory usage

.describe() can be used to generate summary statistics of a DF or Series. It includes the following for each column:

    count
    
    mean
    
    std - the standard deviation
    
    min
    
    25% - the 25th percentile
    
    50%
    
    75%
    
    max

.describe(include='all') can be used for summary statistics of non-numeric values

**These are attributes that can be used, they do not require ()**

    .dtypes the datatype for each column

    .shape the number of rows and columns

    .columns the list of column names

    .index, the labels for each row

## 3. How have we interacted with a dataframe?

Subsetting to access multiple columns with a list of strings

    .head() top 5 by default
    .tail() bottom by by default
    .sample() 1 by default

**Boolean mask/series**

**Drops columns**
    df.drop(columns=['english', 'reading'])

**Renames columns**
    df.rename(columns={'name': 'student'})

**Creating columns**
    df['passing_math'] = df.math > 70

**Add another column using .assign**
    df.assign(passing_english=df.english >= 70)

**Sorting DF by columns**
    df.sort_values(by='english', ascending=False)   ascending is True by default

**Method chaining**
    df[df.english > 90].sort_values(by='english').head(1).name

**loc and iloc**
*loc*
Select all the rows and a subset of columns; notice the inclusive behavior of the indexing.

    df.loc[:, 'math':'reading']

*iloc*
Same selection using index
Notice the exclusive behavior of the indexing.

    df.iloc[:3]

**Aggregating for math summary stats**
    df[['english', 'reading', 'math']].agg(['mean', 'min', 'max'])

**Group By for another way to summarize**
The highest math grade from each classroom:
    df.groupby('classroom').math.max()

**np.where**
Create the new column based on an existing column.

    df['passing_math'] = np.where(df.math < 70, 'failing', 'passing')

**Concat**
    pd.concat([df1, df2], axis=0) axis 0 is default, this will put the second df under the first

**Merge** 
    df.merge default settings for commonly used parameters. Very similar to JOIN in SQL

how == Type of merge to be performed.

how=left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

how=right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

how=outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

how=inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

    left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, indicator=False)


---
# Pactice Exercises

In [2]:
import pandas as pd

df = pd.read_csv('https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv')

#### Information about a dataframe

1. Obtain the following information:

In [3]:
# dimensions
df.shape

(800, 13)

In [4]:
# dtypes
df.dtypes

#              int64
Name          object
Type 1        object
Type 2        object
Total          int64
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

In [5]:
# column names
df.columns

Index(['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

In [10]:
# summary statistics
df.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,362.81375,435.1025,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,208.343798,119.96304,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,184.75,330.0,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,364.5,450.0,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,539.25,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,780.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


#### Working with dataframes

In [14]:
# 1. What is the highest HP value present?
    # Create a function named highest_attack.
    # Use the loaded dataframe as an argument.
    
def highest_attack(df):
    return df['Attack'].max()

In [15]:
highest_attack(df)

190

In [25]:
# 2. Which Pokemon possess(es) the highest HP value?

# df[df.HP == df.HP.max()] using dot notation

df[df['HP'] == df['HP'].max()] # using bracket notation

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
261,242,Blissey,Normal,,540,255,10,10,75,135,55,2,False


In [27]:
 # How many different types are represented in Type 1?

# Create a function named num_types
# Use the loaded dataframe as an argument

def num_types(df):
    return df['Type 1'].nunique()

In [28]:
num_types(df)

18

In [34]:
# 4. Number of Pokemon whose Type 2 is Ghost

df[df['Type 2'] == 'Ghost'].shape[0]

14

In [40]:
# 5. Percentage of Pokemon whose Type 2 is Ghost

    # Create a function named percent_ghost
    # Use the loaded dataframe as an argument
    
def percent_ghost(df):
    num_ghost = df[df['Type 2'] == 'Ghost'].shape[0]
    num_total = df.shape[0]
    percent = num_ghost / num_total * 100
    return '{:.2f}%'.format(percent)


In [41]:
percent_ghost(df)

'1.75%'

In [46]:
# 6. Number of Pokemon whose Attack is greater than Defense

df[df['Attack'] > df['Defense']].shape[0]

433

In [77]:
# 7. Lowest speed for Grass or Rock

grass_or_rock = df['Type 1'].isin(['Grass', 'Rock']) | df['Type 2'].isin(['Grass', 'Rock'])

lowest_speed = df['Speed'].min()

df[grass_or_rock & (df['Speed'] == lowest_speed)]

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
230,213,Shuckle,Bug,Rock,505,20,10,230,10,230,5,2,False


In [76]:
# Get the Lowest speed for grass or rock
df.loc[grass_or_rock]['Speed'].min()

5