<a href="https://colab.research.google.com/github/davidwan08/Data-Exploration-Modeling-and-Analysis-Project/blob/main/Data_Exploration_and_Modeling_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Task**

Your second project is going to have a lot more freedom than your first project. This is because we want you to have a project in your portfolio that interests you or relates to the industry you would like to work in.

Your task for this week is to propose two possible datasets you would like to work with for Project 2.  

You will choose your first choice data set, and a backup data set in case the first proposed data set is not approved.  

This data can be from any source and can be on any topic with these limitations:

the data must be available for use (it is your responsibility to ensure that the license states that you are able to use it.)
the data must be appropriate for a professional environment
the data must NOT contain personal information
the data must NOT be a dataset used for any assignment, lecture, or task from the course
Make sure you select a dataset that will be reasonable to work with in the amount of time we have left. Think about what questions you could reasonably answer with the dataset you select. 

You must propose two datasets that each have a supervised learning component. You may choose a regression or classification problem for each proposed data set.  

For this task:

Create a Colab notebook where you have uploaded and shown the .head() of each of your data sets.  For each of the proposed datasets, answer the following questions:

First choice: dataset 1

1) Source of data

2) Brief description of data

3) What is the target?

4) Is this a classification or regression problem?

5) How many features?

6) How many rows of data.

7) What, if any, challenges do your foresee in cleaning, exploring, or modeling with this dataset?

Second choice: dataset 2

1) Source of data

2) Brief description of data

3) What is the target?

4) Is this a classification or regression problem?

5) How many features?

6) How many rows of data.

7) What, if any, challenges do your foresee in cleaning, exploring, or modeling with this dataset?

Add your colab notebook to your Github and submit the link below.

# **Data and Library Preparation**

In [1]:
# Imports the necessary libraries.
import pandas as pd

In [2]:
# Mounting of data from Google Drive.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **Dataset 1: Mushroom Classification (Primary)**

## Data Overview

Source: https://www.kaggle.com/datasets/uciml/mushroom-classification

### Context
Although this dataset was originally contributed to the UCI Machine Learning repository nearly 30 years ago, mushroom hunting (otherwise known as "shrooming") is enjoying new peaks in popularity. Learn which features spell certain death and which are most palatable in this dataset of mushroom characteristics. And how certain can your model be?

### Content
This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

- Time period: Donated to UCI ML 27 April 1987

### Inspiration
- What types of machine learning models perform best on this dataset?
- Which features are most indicative of a poisonous mushroom?

## Mushroom Data Dictionary

Attribute Information: (classes: edible=e, poisonous=p)

- cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
- cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
- cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
- bruises: bruises=t,no=f
- odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
- gill-attachment: attached=a,descending=d,free=f,notched=n
- gill-spacing: close=c,crowded=w,distant=d
- gill-size: broad=b,narrow=n
- gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
- stalk-shape: enlarging=e,tapering=t
- stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
- stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
- stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
- veil-type: partial=p,universal=u
- veil-color: brown=n,orange=o,white=w,yellow=y
- ring-number: none=n,one=o,two=t
- ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
- spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
- population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
- habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

## Data Loading and Exploration

- Let's examine the dataset in depth.

In [3]:
# Load the mushroom dataset into Python.
mushroom_file = '/content/drive/MyDrive/Coding Dojo/Machine Learning/mushrooms.csv'
mushrooms_df = pd.read_csv(mushroom_file)
mushrooms_df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [5]:
# Get an overview of the mushroom dataset.
mushrooms_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [7]:
# Check for duplicated entries in the dataset.
mushrooms_df.duplicated().sum()

0

In [6]:
# Check for missing values in each column.
mushrooms_df.isna().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [14]:
# Check the number of rows and columns in this dataset.
mushrooms_df.shape

(8124, 23)

- So far, no duplicated entries or missing values in the dataset.
- The target variable is the classification of the mushroom as either edible or poisonous ('class' column). Therefore, this will be a classification problem.
- There are 8124 entries and 22 features in this dataset.
- I do not really see much data cleaning issues involved with imputations and datatype issues. However, there are a lot of features involved with this dataset, so there may be a potential for multicollinearity as we perform our modeling and classification predictions.
- Also, there may be some columns that may have to be reclassified based on nominal or ordinal information. This will have to be addressed prior to validation splitting.
- All in all, this is a very interesting dataset that intrigues me a lot.

# **Dataset 2: Legendary Pokemon Classification (Backup)**

## Data Overview

Source: https://www.kaggle.com/datasets/rounakbanik/pokemon?select=pokemon.csv

## About Dataset

### Context
This dataset contains information on all 802 Pokemon from all Seven Generations of Pokemon. The information contained in this dataset include Base Stats, Performance against Other Types, Height, Weight, Classification, Egg Steps, Experience Points, Abilities, etc. The information was scraped from http://serebii.net/


### Acknowledgements
The data was scraped from http://serebii.net/.

### Inspiration
Pokemon holds a very special place in my heart as it is probably the only video game I have judiciously followed for more than 10 years. With this dataset, I wanted to be able to answer the following questions:

- Is it possible to build a classifier to identify legendary Pokemon?
- How does height and weight of a Pokemon correlate with its various base stats?
- What factors influence the Experience Growth and Egg Steps? Are these quantities correlated?
- Which type is the strongest overall? Which is the weakest?
- Which type is the most likely to be a legendary Pokemon?
- Can you build a Pokemon dream team? A team of 6 Pokemon that inflicts the most damage while remaining relatively impervious to any other team of 6 Pokemon.

## Pokemon Data Dictionary

- name: The English name of the Pokemon
- japanese_name: The Original Japanese name of the Pokemon
- pokedex_number: The entry number of the Pokemon in the National Pokedex
- percentage_male: The percentage of the species that are male. Blank if the Pokemon is genderless.
- type1: The Primary Type of the Pokemon
- type2: The Secondary Type of the Pokemon
- classification: The Classification of the Pokemon as described by the Sun and Moon Pokedex
- height_m: Height of the Pokemon in metres
- weight_kg: The Weight of the Pokemon in kilograms
- capture_rate: Capture Rate of the Pokemon
- baseeggsteps: The number of steps required to hatch an egg of the Pokemon
- abilities: A stringified list of abilities that the Pokemon is capable of having
- experience_growth: The Experience Growth of the Pokemon
- base_happiness: Base Happiness of the Pokemon
- against_?: Eighteen features that denote the amount of damage taken against an attack of a particular type
- hp: The Base HP of the Pokemon
- attack: The Base Attack of the Pokemon
- defense: The Base Defense of the Pokemon
- sp_attack: The Base Special Attack of the Pokemon
- sp_defense: The Base Special Defense of the Pokemon
- speed: The Base Speed of the Pokemon
- generation: The numbered generation which the Pokemon was first introduced
- is_legendary: Denotes if the Pokemon is legendary.

## Data Loading and Exploration

- Let's examine the dataset in depth.

In [16]:
# Load the mushroom dataset into Python.
pokemon_file = '/content/drive/MyDrive/Coding Dojo/Machine Learning/pokemon.csv'
pokemon_df = pd.read_csv(pokemon_file)
pokemon_df.head()

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,1,65,65,45,grass,poison,6.9,1,0
1,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,2,80,80,60,grass,poison,13.0,1,0
2,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,3,122,120,80,grass,poison,100.0,1,0
3,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,4,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,5,80,65,80,fire,,19.0,1,0


In [17]:
# Get an overview of the pokemon dataset.
pokemon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          801 non-null    object 
 1   against_bug        801 non-null    float64
 2   against_dark       801 non-null    float64
 3   against_dragon     801 non-null    float64
 4   against_electric   801 non-null    float64
 5   against_fairy      801 non-null    float64
 6   against_fight      801 non-null    float64
 7   against_fire       801 non-null    float64
 8   against_flying     801 non-null    float64
 9   against_ghost      801 non-null    float64
 10  against_grass      801 non-null    float64
 11  against_ground     801 non-null    float64
 12  against_ice        801 non-null    float64
 13  against_normal     801 non-null    float64
 14  against_poison     801 non-null    float64
 15  against_psychic    801 non-null    float64
 16  against_rock       801 non

In [18]:
# Check for duplicated entries in the dataset.
pokemon_df.duplicated().sum()

0

In [19]:
# Check for missing values in each column.
pokemon_df.isna().sum()

abilities              0
against_bug            0
against_dark           0
against_dragon         0
against_electric       0
against_fairy          0
against_fight          0
against_fire           0
against_flying         0
against_ghost          0
against_grass          0
against_ground         0
against_ice            0
against_normal         0
against_poison         0
against_psychic        0
against_rock           0
against_steel          0
against_water          0
attack                 0
base_egg_steps         0
base_happiness         0
base_total             0
capture_rate           0
classfication          0
defense                0
experience_growth      0
height_m              20
hp                     0
japanese_name          0
name                   0
percentage_male       98
pokedex_number         0
sp_attack              0
sp_defense             0
speed                  0
type1                  0
type2                384
weight_kg             20
generation             0


In [20]:
# Check the number of rows and columns in this dataset.
pokemon_df.shape

(801, 41)

- So far, no duplicated entries in the dataset.
- There are quite a bit of missing values in the columns.
- The target variable is the classification of the pokemon as either legendary or non-legendary ('is_legendary' column). Therefore, this will be a classification problem.
- There are 801 entries and 40 features in this dataset.
- Data cleaning is likely to be an issue with this dataset. Some of the columns contain information that do not help with the modeling process of this dataset.
- In additon, there are a lot of actual features involved with this dataset, and there may be a potential for multicollinearity as we perform our modeling and classification predictions.
- Based on my knowledge of pokemon, there are a lot of columns that contain numerical information that actually contain either ordinal or nominal. This will have to be addressed prior to validation splitting.
- All in all, this is also a very interesting dataset that intrigues me greatly. However, this dataset is definitely messier than the previous dataset and will require a lot more cleaning and organization to prepare the dataset for modeling and analysis.