<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:Black;
           font-size:210%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:center;"
          >
       WELCOME TO MY NOTEBOOK
</p>
</div>

**Dataset: Possum**

The dataset, called **'possum,'** contains nine morphometric measurements for 104 mountain brushtail possums. These possums were captured at seven different locations spanning from Southern Victoria to central Queensland.

![](https://media2.giphy.com/media/mCDJrksAk4dINqtNei/giphy.gif)

In this notebook we are going to predict the **Headlength of Possum**. There are 14 variables in the dataset which are given below:
1. Case
2. Site
3. chest
4. Footlength
5. Skullwidth
6. Belly
7. Age
8. Footlength
9. Totallength
10. Eye
11. Earconch
12. Tail
13. Sex
14. Population

# Import all the Libraries

In [None]:
# import all the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,VotingRegressor



# Read the Dataset

In [None]:
# Read the dataset
dataframe= pd.read_csv("/kaggle/input/openintro-possum/possum.csv")
dataframe.head(5)

In [None]:
# lets check the shape of dataset
dataframe.shape

In [None]:
# lets check Is there any null values in the dataset
dataframe.isna().sum()

> Here we can see that there are two null values in the age column and one null value in the footlength.

In [None]:
# Let drop the case column because it is just show the Id
dataframe.drop(["case"], inplace=True, axis=1)

In [None]:
# Lets check the stat of data
dataframe.describe()

# Getting Categorical and Numerical Columns

In [None]:
# Getting Categorical and numerical columns
categorical_columns=dataframe.select_dtypes(include="object").columns
numerical_columns=dataframe.select_dtypes(exclude="object").columns

In [None]:
print(f"categorical_columns: {categorical_columns}")
print(f"numerical_columns: {numerical_columns}")

# Lets Do the Exploratory Data Analysis

In [None]:
df=dataframe.drop(["Pop","sex"],axis=1)
df.corr().style.background_gradient(cmap='coolwarm')

> 
  1. Here we can notice that footlength with earconch has a correlation of 0.78.
  2. Here we can see that headlength and skullwidth has a corrletion of 0.71.
  3. Chest with Headlength and Chest with Skullwidth has a correlation of 0.63.
  4. Chest with Belly has a correlation of 0.61

# Univariate Analysis

# Lets see the Distribution Of Numerical Columns

In [None]:
colors=["red","blue", "green","orange","black","purple", "brown","pink","red","blue", "green"]

for i in range(11):
    plt.figure(figsize=(5,5))
    sns.set(style="darkgrid")
    sns.histplot(dataframe, x=dataframe[numerical_columns[i]], kde=True, color=colors[i])
    plt.title(f"Distribution of {numerical_columns[i]}")
    plt.show()

# Lets have a look on Categorical Columns

In [None]:
dataframe["Pop"].unique()

> The Population either belongs to Vic (Victoria) or other (New South Wales or Queensland).

In [None]:
dataframe["sex"].unique()

In [None]:
colors=["red","blue"]
for i in range(2):
    plt.figure(figsize=(5,5))
    sns.countplot(dataframe, x=dataframe[categorical_columns[i]],color=colors[i])
    plt.title(f"Countplot of {categorical_columns[i]}")
    plt.show()
    

> 1. Here we can see that 60% Possum belongs to Male Category and only 40% have Female.

# Bivariate Analysis

In [None]:
fig = px.scatter(dataframe, x="hdlngth", y="age", color="age",trendline="ols", title="Headlength vs Age")
fig.show()

In [None]:
fig = px.scatter(dataframe, x="hdlngth", y="skullw", size="skullw",color="skullw",trendline="ols", title="Headlength vs Skullwidth")
fig.show()

In [None]:
fig = px.scatter(dataframe, x="footlgth", y="earconch", color="earconch", trendline="ols",title="Footlength vs Earconch")
fig.show()

In [None]:
fig = px.scatter(dataframe, x="hdlngth", y="chest", color="chest", trendline="ols", title="Headlength vs Chest")
fig.show()

In [None]:
fig = px.scatter(dataframe, x="belly", y="chest", color="chest", trendline="ols", title="Belly vs Chest")
fig.show()

In [None]:
fig = px.scatter(dataframe, x="skullw", y="chest", color="chest", trendline="ols", title="Skullwidth vs Chest")
fig.show()

> # In Bivariate Analysis, we notice the Positive Correlation between above variables that we can also observe from the Correlation Matrix.

In [None]:
fig = px.box(dataframe, x="sex", y="hdlngth", points="all", color="sex", title="Headlength vs Sex")
fig.show()


> # Here we see that the Male Possusum has a Bigger Head as compare to Female Possum.

In [None]:
fig = px.box(dataframe, x="Pop", y="hdlngth", points="all", color="sex", title="Headlength vs Population")
fig.show()


> # Here we observe that more no. of Male and Female Possum Belongs to other (New South Wales or Queensland) category than Victoria.

# Handling Missing Values in the Dataset

In [None]:
dataframe["age"]= dataframe["age"].fillna(dataframe["age"].median())
dataframe["footlgth"]= dataframe["footlgth"].fillna(dataframe["footlgth"].median())

# Divide Dataset into Train and Test Set

In [None]:
length= len(dataframe)
train_data=dataframe.iloc[: int(length * 0.7)]
test_data=dataframe.iloc[int(length * 0.7): ]

In [None]:
train_data.shape

In [None]:
test_data.shape

# Detecting Outliers in the Dataset

In [None]:
def Percentile_Method(columns, dataframe, a, b):
    
    outliers=[]

    for col in columns:
        q1= np.percentile(dataframe[col], a)
        q2= np.percentile(dataframe[col],b)
        
        for pos in range(len(dataframe)):
            if dataframe[col].iloc[pos]>q2 or dataframe[col].iloc[pos]<q1:
                outliers.append(pos) 
                
    outliers= set(outliers)                   # remove the duplicates from the outliers
    outliers= list(outliers)
    
    ratio= round(len(outliers)/len(dataframe)*100, 2)                       # Ratio of outliers
    dataframe.drop(dataframe.index[outliers], inplace=True)    # remove the outliers from the dataset
    
    
    return ratio, dataframe

In [None]:
ratio,train_data= Percentile_Method(numerical_columns,train_data, a=0.3, b=99.8)
print(f"Ratio of Detected Outliers:{ratio}")

In [None]:
train_data.shape

# Data Preprocessing

In [None]:
x_train=train_data.drop("hdlngth", axis=1)
y_train=train_data["hdlngth"]

x_test=test_data.drop("hdlngth", axis=1)
y_test=test_data["hdlngth"]

In [None]:
x_train.shape, y_train.shape,  x_test.shape, y_test.shape

 # Label Encoding

In [None]:
# Label Encoding
le= LabelEncoder()
for col in categorical_columns:
    x_train[col]= le.fit_transform(x_train[col])
    x_test[col]=le.transform(x_test[col])

# Normalization

In [None]:
# Lets normalize the data into common scale
numerical_columns = ['site', 'age', 'skullw', 'totlngth', 'taill', 'footlgth', 'earconch', 'eye', 'chest', 'belly']

std= StandardScaler()
x_train[numerical_columns]= std.fit_transform(x_train[numerical_columns])
x_test[numerical_columns]= std.transform(x_test[numerical_columns])

# Modelling

In [None]:
# Define the Hyperparameters
hyper_rf= {'n_estimators':[100,200,300,400], 'max_depth':[5,10,11], 'min_samples_split':[2,3,4], 'criterion':['squared_error'], 'n_jobs':[-1]}

hyper_gbr= {"n_estimators":[500,600],
          "learning_rate":[0.01,0.001,0.1],
          "max_depth":[3,4],
          "max_features":['sqrt'],
          "min_samples_leaf":[10,12,15],
          "min_samples_split":[8,10],
          }


# Create the Models
rf_model= RandomForestRegressor()
gbr_model = GradientBoostingRegressor() 

models=[rf_model,gbr_model]
parameters=[hyper_rf,hyper_gbr]



rmse=[]
r2=[]
for i in range(len(models)):
    model= GridSearchCV(models[i], parameters[i], cv=5, scoring="r2", n_jobs=-1)
    model.fit(x_train,y_train)
    y_preds=model.predict(x_test)
    print(model.best_estimator_)
    print("---------------------------------------------------------------")
    rmse.append(np.sqrt(mean_squared_error(y_test, y_preds)))
    r2.append(r2_score(y_test, y_preds))

# Results

In [None]:
model_names = ['RandomForest','GradientBoost']
result_df = pd.DataFrame({'RMSE':rmse,'R2_score': r2}, index=model_names)
result_df

In [None]:
result_df["RMSE"].plot(kind="barh", figsize=(9, 6), color="blue").legend(bbox_to_anchor=(1.0, 1.0))