![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Data Transformation Cheat Sheet

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Table of Content

1. Creating a Pandas Dataframe
    - Setting an Index
    - Sorting Dataframe
2. Modifying Datetime
    - Converting Year and Month
3. Summary Statistics
4. Column Wrangling & Feature Engineering
    - Modifying All
    - Modify the first appearance of Vision to year 1964
5. Pivoting Data
    - Wide
    - Long
6. Aggregating Data
    - Calculating Percentage Share
    - Calculating Unique Values
    - Grouping by Dates and Frequency (Week)
    - Named Aggregation
7. Selecting, Indexing, Filtering & Slicing
8. Filtering
    - Filtering Multiple Values
    - Retrieving Single and Multiple Columns
    - Filtering then summarizing
9. Modifying Dataframes
    - Renaming Columns
10. Appending Values
11. Cleaning Null Values
    - Filling Null Values on Dataframes
    - Backward and Forward filling
12. Cleaning All Values
13. Dealing w/ Duplicates
    - Duplicates in DataFrames
14. Changing Datatypes
15. Joining Data
16. Creating a Progress Bar


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Libraries

In [None]:
import pandas as pd
import numpy as np

In [None]:
pd.read_csv?

In [None]:
# Please see the Data Ingestion Cheat sheet to learn how to load data
sales = pd.read_csv(
    'data/sales_data.csv',
    parse_dates=['Date']
)

In [None]:
census_county = pd.read_csv("data/census_county.csv")

new = census_county["NAME"].str.split(", ", expand = True)
census_county["county_name"] = new[0]
census_county["state_name"] = new[1]

census_county.drop(columns=['NAME'], inplace=True)
census_county.head()

The read_csv function is extremely powerful and you can specify many more parameters at import time. We can achive the same results with only one line by doing:

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
# Creating a Pandas Dataframe


In [None]:
country_stat = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Setting the Index

In [None]:
country_stat.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [None]:
country_stat.sort_values(by=['Continent','Population','HDI'], ascending=False)

In [None]:
btc_price = pd.read_csv(
    'data/btc-market-price.csv',
    header=None,
    names=['Timestamp', 'Price'],
    index_col=0,
    parse_dates=True
)

In [None]:
btc_price.head()

In [None]:
marvel_data = [
    ['Spider-Man', 'male', 1962],
    ['Captain America', 'male', 1941],
    ['Wolverine', 'male', 1974],
    ['Iron Man', 'male', 1963],
    ['Thor', 'male', 1963],
    ['Thing', 'male', 1961],
    ['Mister Fantastic', 'male', 1961],
    ['Hulk', 'male', 1962],
    ['Beast', 'male', 1963],
    ['Invisible Woman', 'female', 1961],
    ['Storm', 'female', 1975],
    ['Namor', 'male', 1939],
    ['Hawkeye', 'male', 1964],
    ['Daredevil', 'male', 1964],
    ['Doctor Strange', 'male', 1963],
    ['Hank Pym', 'male', 1962],
    ['Scarlet Witch', 'female', 1964],
    ['Wasp', 'female', 1963],
    ['Black Widow', 'female', 1964],
    ['Vision', 'male', 1968]
]

marvel_df = pd.DataFrame(data=marvel_data,
                         columns=['name', 'sex', 'first_appearance'])
marvel_df

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Modifying Datetime

In [None]:
btc_price.dtypes

In [None]:
btc_price = btc_price.reset_index()
btc_price['Timestamp'] = pd.to_datetime(btc_price['Timestamp'])
btc_price.dtypes

In [None]:
btc_price.set_index('Timestamp', inplace=True)

In [None]:
btc_price.loc['2017-09-29':'2017-10-05']

In [None]:
import datetime

year = datetime.date.today().year

marvel_df['years_since'] = year - marvel_df['first_appearance']
marvel_df

## Dealing with Dates

In [None]:
sales['Year'] = sales['Date'].dt.year
sales['Month'] = sales['Date'].dt.month
sales

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
# Summary Statistics

In [None]:
sales.head()

In [None]:
sales.shape

In [None]:
sales.info()

In [None]:
sales.columns

In [None]:
sales.describe()

In [None]:
sales[['Age_Group','Unit_Price']].groupby('Age_Group').describe()

In [None]:
sales.size

In [None]:
sales.dtypes

In [None]:
sales.dtypes.value_counts()

In [None]:
sales['Unit_Cost'].describe()

In [None]:
sales['Unit_Cost'].mean()

In [None]:
sales['Unit_Cost'].median()

In [None]:
sales['Unit_Cost'].min(), sales['Unit_Cost'].max()

In [None]:
sales['Unit_Cost'].std()

In [None]:
sales['Unit_Cost'].quantile(0.25)

In [None]:
sales['Unit_Cost'].quantile([.2, .4, .6, .8, 1])

In [None]:
sales['Age_Group'].value_counts()

In [None]:
sales['Age_Group'].value_counts(normalize=True)

In [None]:
country_group = sales.groupby(['Country'])
country_group['Age_Group'].value_counts(normalize=True).loc['Australia']

In [None]:
pd.DataFrame(sales['Age_Group'].value_counts(normalize = True)).plot(kind = 'bar', figsize = (10,5))

In [None]:
sales

In [None]:
corr = sales.corr()
corr

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
# Column Wrangling & Feature Engineering

In [None]:
sales['Revenue_per_Age'] = sales['Revenue'] / sales['Customer_Age']
sales['Revenue_per_Age'].head()

In [None]:
sales['Calculated_Cost'] = sales['Order_Quantity'] * sales['Unit_Cost']
sales['Calculated_Cost'].head()

In [None]:
(sales['Calculated_Cost'] != sales['Cost']).sum()

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Modifying All

In [None]:
sales['Calculated_Cost'] *= 1.03
sales['Calculated_Cost'].head()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Modify the `first appearance` of `Vision` to year 1964

In [None]:
marvel_df.loc['Vision', 'first_appearance'] = 1964
marvel_df

In [None]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()

In [None]:
sales.loc[sales['Country'] == 'France', 'Revenue'] *= 1.10
sales.loc[sales['Country'] == 'France', 'Revenue'].head()

In [None]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])
crisis

In [None]:
crisis[['GDP', 'HDI']]

In [None]:
crisis[['GDP', 'HDI']] + crisis

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
# Pivoting Data

## Long

In [None]:
d1 = {"Name": ["Pankaj", "Lisa", "David"], "ID": [1, 2, 3], "Role": ["CEO", "Editor", "Author"]}
df_wide = pd.DataFrame(d1)
df_wide

In [None]:
df_long = pd.melt(df_wide, id_vars=["ID"], value_vars=["Name", "Role"], var_name="Attribute", value_name="Value")
df_long

## Wide

In [None]:
# https://beta.bls.gov/dataQuery/find?fq=survey:[ap]&s=popularity:D
# This data came from bls
df_long = pd.read_csv("data/file.csv")
df_long

In [None]:
# unmelting using pivot()
df_wide=pd.pivot(df_long, index=['Series ID','Item'], columns = 'Year Month',values = 'Avg. Price ($)') #Reshape from long to wide

df_wide

## Long

In [None]:
year_list=list(df_wide.columns)
df_long = pd.melt(df_wide, value_vars=year_list,value_name='Avg. Price ($)', ignore_index=False).reset_index()
df_long

# Aggregating Data

## Calculating Percentage Share

In [None]:
sales_yr_age = sales.groupby(['Year','Age_Group']).agg({'Profit':'sum','Revenue':'sum'})
sales_yr_age['Revenue_Share'] = sales_yr_age['Revenue'] / sales_yr_age.groupby(['Year'])['Revenue'].sum()
sales_yr_age['Profit_Share'] = sales_yr_age['Profit'] / sales_yr_age.groupby(['Year'])['Profit'].sum()
sales_yr_age.reset_index()

In [None]:
df_freq = pd.DataFrame({
    "Publish Date" : [
        pd.Timestamp("2000-01-01"),
        pd.Timestamp("2000-01-02"),
        pd.Timestamp("2000-01-02"),
        pd.Timestamp("2000-01-02"),
        pd.Timestamp("2000-01-09"),
        pd.Timestamp("2000-01-16")
    ],
    "ID": [0, 1, 2, 3, 4, 5],
    "Price": [10, 20, 30, 40, 50, 60]
    }
)
df_freq

In [None]:
df_freq.groupby('Publish Date')['Price'].mean()

In [None]:
df_freq.groupby(pd.Grouper(key = "Publish Date", freq= "1D"))['Price'].mean()

## Counting Unique Values

In [None]:
sales.groupby('Age_Group').agg({'State':'nunique'})

## Aggregating on Multiple Values

In [None]:
sales_country = sales.groupby('Country').agg({'Order_Quantity':np.sum,'Revenue':np.sum}).reset_index()
sales_country

In [None]:
sales_country['aov'] = sales_country['Revenue']/ sales_country['Order_Quantity']
sales_country

In [None]:
# USE as_index = False INSTEAD OF reset_index()
sales.groupby('Country', as_index=False).agg(
    Order_Total = ("Order_Quantity", np.sum),
    Orders_Avg = ("Order_Quantity", np.mean),
    Revenue_Total = ("Revenue", np.sum),
    Revenue_avg = ("Revenue", np.mean),
    Unique_States = ("State", pd.Series.nunique)
)

In [None]:
sales.head()

## Weight Average Calculation

In [None]:
def wavg(group, avg_name, weight_name):
    """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
    In rare instance, we may not have weights, so just return the mean. Customize this if your business case
    should return otherwise.
    """
    d = group[avg_name]
    w = group[weight_name]
    try:
        return (d * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

In [None]:
census_wavg = census_county.groupby(['state_name']).apply(wavg, 'population','median_income').reset_index()
census_wavg = census_wavg.rename(columns={0: 'median_income'})
census_wavg

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Selecting, Indexing, Filtering & Slicing

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
# Filtering

#### Loc is useful for Filtering data where the second line is good for selecting data

#### Look for individual rows using the loc function

In [None]:
sales = sales.set_index('Country')

In [None]:
sales.loc['Canada']

#### select the last row by sequential position

In [None]:
sales.iloc[-1]

In [None]:
sales = sales.reset_index()
sales['Country']

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
# Filtering Multiple Value

In [None]:
geo_list = ['Canada','Australia','United States']
geo_filter = sales['Country'].isin(geo_list)
sales[geo_filter]

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## Retrieving Single and Multiple columns

In [None]:
sales.loc[(sales['Month'] == 11) &
          (sales['Year'] == 2013),
          ['Age_Group','Revenue']
]

In [None]:
sales.query("State == 'Kentucky' and Year == 2014")[['Year','Age_Group','Revenue']]

In [None]:
sales.loc[sales['State'] == 'Kentucky', ['Age_Group','Revenue']]

In [None]:
country_stat.loc['France': 'Italy']

In [None]:
country_stat.loc['France': 'Italy', 'Population']

In [None]:
country_stat.loc['France': 'Italy', ['Population', 'GDP']]

In [None]:
sales.loc[sales['Age_Group'] == 'Adults (35-64)', 'Revenue'].head()

### Get the mean revenue of the `Adults (35-64)` sales group

In [None]:
sales.loc[sales['Age_Group']=='Adults (35-64)', 'Revenue'].mean()

### How Many Records belong to Age Group `Youth (<25)` or `Adults 35-64`?

In [None]:
sales.loc[(sales['Age_Group'] == 'Youth (<25)') | (sales['Age_Group'] == 'Adults (35-64)')].shape[0]

### Get the mean revenue of the sales group `Adults (35-64)` in `United States`

In [None]:
sales.loc[(sales['Age_Group'] == 'Adults (35-64)') & (sales['Country'] == 'United States'), 'Revenue'].mean()

### Increase the revenue by 10% to every sale made in France

In [None]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Dropping Data

In [None]:
country_stat.drop('Canada')

In [None]:
country_stat.drop(['Canada', 'Japan'])

In [None]:
country_stat.drop(columns=['Population', 'HDI'])

#### Drop rows using axis = 0 or rows

In [None]:
country_stat.drop(['Canada', 'Germany'], axis=0)

In [None]:
country_stat.drop(['Canada', 'Germany'], axis='rows')

#### Drop Columns using axis = 1 or columns

In [None]:
country_stat.drop(['Population', 'HDI'], axis=1)

In [None]:
country_stat.drop(['Population', 'HDI'], axis='columns')

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Modifying Dataframes

In [None]:
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language'
)

In [None]:
langs

In [None]:
country_stat['Language'] = langs
country_stat

In [None]:
country_stat['Language'] = 'English'
country_stat

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Renaming Columns

In [None]:
country_stat.rename(
    columns={
        'HDI': 'Human Development Index',
        'Anual Popcorn Consumption': 'APC'
    }, index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    })

In [None]:
country_stat.rename(index=str.upper)

In [None]:
country_stat.rename(index=str.lower)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Appending Values

In [None]:
country_stat.append(pd.Series({
    'Population': 3,
    'GDP': 5
}, name='China'))

In [None]:
country_stat.loc['China'] = pd.Series({'Population': 1_400_000_000, 'Continent': 'Asia'})
country_stat

In [None]:
country_stat.reset_index()

In [None]:
country_stat.set_index('Population')

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Cleaning Data

In [None]:
pd.isnull(np.nan)

In [None]:
pd.isnull(None)

In [None]:
pd.isna(np.nan)

In [None]:
pd.isna(None)

In [None]:
pd.notnull(None)

In [None]:
pd.isnull(pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 2, 3],
    'Column C': [np.nan, 2, np.nan]
}))

In [None]:
pd.Series([1, 2, np.nan]).count()

In [None]:
pd.Series([1, 2, np.nan]).sum()

In [None]:
pd.Series([2, 2, np.nan]).mean()

In [None]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])

In [None]:
pd.notnull(s)

In [None]:
pd.isnull(s)

In [None]:
s.isnull()

In [None]:
s.notnull()

In [None]:
s[s.notnull()]

In [None]:
s

In [None]:
s.dropna()

In [None]:
df_nulls = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110],
})

df_nulls

In [None]:
df_nulls.info()

In [None]:
df_nulls.isnull().sum()

In [None]:
df_nulls.dropna()

In [None]:
df_nulls.dropna(axis=1)

In this case, any row or column that contains **at least** one null value will be dropped. Which can be, depending on the case, too extreme. You can control this behavior with the `how` parameter. Can be either `'any'` or `'all'`:

In [None]:
df_nulls2 = pd.DataFrame({
    'Column A': [1, np.nan, 30],
    'Column B': [2, np.nan, 31],
    'Column C': [np.nan, np.nan, 100]
})
df_nulls2

In [None]:
df_nulls2.dropna(how='all') # if all columns have NA then drop

In [None]:
df_nulls2.dropna(how = 'any') # default behavior

In [None]:
df_nulls2.dropna(thresh=3) # if at least 3 values are NA

In [None]:
df_nulls2.dropna(thresh=3, axis='columns')

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Filling Null values on Dataframes

Filling with preset numbers or the statistical measure

In [None]:
df_nulls2.fillna(value = 0)

In [None]:
df_nulls2.fillna(value = df_nulls2['Column A'].mean())

In [None]:
df_nulls2.fillna({'Column A': 0, 'Column B': 99, 'Column C': df_nulls['Column C'].mean()})

Forwards filling or backward filling

In [None]:
df_nulls2.fillna(method='ffill', axis=0)

In [None]:
df_nulls2.fillna(method='bfill', axis=0)

In [None]:
df_nulls2.dropna().count()

In [None]:
rents = pd.read_csv('data/zillow_data.csv')
rents.describe()

In [None]:
rents.info()

In [None]:
import plotly.express as px

In [None]:
px.line(rents, x = 'rent_date', y = 'avg_rents', color='RegionName')

### We can see in the above line that there is missing data. Let's use interpolate

In [None]:
rents[rents['RegionName'] == 'New York County'] = rents[rents['RegionName'] == 'New York County'].interpolate(method = 'linear')

In [None]:
px.line(rents, x = 'rent_date', y = 'avg_rents', color='RegionName')

In [None]:
missing_values = len(s.dropna() != len(s))
missing_values

In [None]:
len(s)

In [None]:
s.count()

**More Pythonic solution `any`**
The methods `any` and `all` check if either there's `any` True value in a Series or `all` the values are `True`. They work in the same way as in Python:

In [None]:
pd.Series([True, False, False]).any()

In [None]:
pd.Series([True, False, False]).all()

In [None]:
pd.Series([True, True, True]).all()

In [None]:
s.isnull()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Cleaning All Values

The previous `DataFrame` doesn't have any "missing value", but clearly has invalid data. `290` doesn't seem like a valid age, and `D` and `?` don't correspond with any known sex category. How can you clean these not-missing, but clearly invalid values then?

In [None]:
gender = pd.DataFrame({
    'Sex': ['M', 'F', 'F', 'D', '?'],
    'Age': [29, 30, 24, 290, 25],
})
gender

In [None]:
gender['Sex'].replace('D','F')

It can accept a dictionary of values to replace. For example, they also told you that there might be a few `'N's`, that should actually be `'M's`:

In [None]:
gender['Sex'].replace({'D': 'F', 'N': 'M'})

If you have many columns to replace, you could apply it at “DataFrame level”:

In [None]:
gender.replace({
    'Sex': {
        'D': 'F',
        'N': 'M'
    },
    'Age': {
        290: 29
    }
})

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Dealing with Duplicates

## The sales dataset contains information about daily store purchases. What is the correct way to return unique entries for each product?

In [None]:
sales_datacamp = pd.DataFrame(data={'date':['2018-01-15','2018-01-15','2018-01-16','2018-01-16','2018-01-17'],
                           'product_line':['Health and beauty', 'Electronic accessories', 'Home and lifestyle', 'Sports', 'Food and beverages'],
                           'product': ['Shampoo', 'Headphones', 'Lamp', 'Yoga mat', 'Milk'],
                           'unit_price': [6.99,25.38,46.33,39.99,5.99],
                           'quantity': [7,5,3,5,8]
                           })
sales_datacamp

In [None]:
sales_datacamp.drop_duplicates(subset='product')

In [None]:
ambassadors = pd.Series([
    'France',
    'United Kingdom',
    'United Kingdom',
    'Italy',
    'Germany',
    'Germany',
    'Germany',
], index=[
    'Gérard Araud',
    'Kim Darroch',
    'Peter Westmacott',
    'Armando Varricchio',
    'Peter Wittig',
    'Peter Ammon',
    'Klaus Scharioth '
])

In [None]:
ambassadors

In [None]:
ambassadors.duplicated()

In this case `duplicated` didn't consider `'Kim Darroch'`, the first instance of the United Kingdom or `'Peter Wittig'` as duplicates. That's because, by default, it'll consider the first occurrence of the value as not-duplicate. You can change this behavior with the `keep` parameter:

In [None]:
ambassadors.duplicated(keep='last')

In this case, the result is "flipped", `'Kim Darroch'` and `'Peter Wittig'` (the first ambassadors of their countries) are considered duplicates, but `'Peter Westmacott'` and `'Klaus Scharioth'` are not duplicates. You can also choose to mark all of them as duplicates with `keep=False`:

In [None]:
ambassadors.duplicated(keep=False)

A similar method is `drop_duplicates`, which just excludes the duplicated values and also accepts the `keep` parameter:

In [None]:
ambassadors.drop_duplicates(keep='last')

In [None]:
ambassadors.drop_duplicates(keep=False)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Duplicates in DataFrames

Conceptually speaking, duplicates in a DataFrame happen at "row" level. Two rows with exactly the same values are considered to be duplicates:

In [None]:
players = pd.DataFrame({
    'Name': [
        'Kobe Bryant',
        'LeBron James',
        'Kobe Bryant',
        'Carmelo Anthony',
        'Kobe Bryant',
    ],
    'Pos': [
        'SG',
        'SF',
        'SG',
        'SF',
        'SF'
    ]
})

In [None]:
players

In [None]:
players.duplicated()

In [None]:
players.duplicated(subset=['Name'])

In [None]:
players.duplicated(subset=['Name'], keep='last')

In [None]:
players.drop_duplicates()

In [None]:
players.drop_duplicates(subset=['Name'])

In [None]:
players.drop_duplicates(subset=['Name'], keep='last')

In [None]:
sales['Year_Char'] = sales['Year'].astype(str)
sales.dtypes

## Changing Datatype
#### Encode `Product` as a `category` type. The columns of data frame `df` now have the following types:

In [None]:
sales['Product'] = sales.Product.astype('category')
sales.dtypes

In [None]:
sales.info()

#### Int Downcasting Value Range

- int8 can store integers form -128 to 127
- int16 can store integers from -32768 to 32767
- int64 can store integers from -9223372036854775808 to 9223372036854775807

In [None]:
sales.describe()

In [None]:
int8_dt = ['Day','Month','Year','Customer_Age','Order_Quantity','Unit_Price','Profit','Cost']
sales[int8_dt] = sales[int8_dt].astype('int8')
sales.info()

# Joining Data

In [None]:
census_state = census_county.groupby('state_name').agg({'population':np.sum}).reset_index()

In [None]:
census_state = pd.merge(census_wavg,
                        census_state,
                           how="left",
                           left_on = ["state_name"],
                           right_on = ["state_name"])
census_state

# Case When np.where

## Single Condition

In [None]:
census_state["is_arizona"] = np.where(census_state["state_name"] == 'Arizona', True, False).copy()
census_state

## Using A list

In [None]:
geo_list =  ['Arizona','Nevada','New Mexico','Utah','Texas','Colorado']
census_state["sun_belt"] = np.where(census_state["state_name"].isin(geo_list), True, False).copy()
census_state

## Multiple Condition

In [None]:
census_state["sun_belt"] = np.where((census_state["state_name"] == 'Arizona') |
                                    (census_state["state_name"] == 'Nevada') |
                                    (census_state["state_name"] == 'New Mexico'), True, False).copy()
census_state

## Creating A Progress Bar
This only works for loops

In [None]:
from tqdm import tqdm
import time

In [None]:
for i in tqdm(range(50)):
    time.sleep(1)

In [None]:
from tqdm.notebook import tqdm

In [None]:
for i in tqdm(range(50)):
    time.sleep(1)