# Exploratory Data Analysis (EDA) Outline

## 1. Introduction
- **Objective**: Briefly explain the objective of the EDA.
- **Data Source**: Describe where the data is coming from.

## 2. Data Loading and Initial Exploration
- [ ] Load the data using the appropriate libraries/tools.
- [ ] Check the shape (number of rows and columns) of the dataset.
- [ ] Display the first few rows of the dataset.

## 3. Data Cleaning
- [ ] Check for missing values.
    - [ ] Decide on a strategy to handle missing values (e.g., imputation, deletion).
- [ ] Check for duplicate rows.
    - [ ] Remove duplicates if any.
- [ ] Identify and handle outliers.

## 4. Data Type Conversion
- [ ] Check the data types of each column.
- [ ] Convert data types if necessary (e.g., string to datetime or float to integer).

## 5. Descriptive Statistics
- [ ] Generate summary statistics for numerical columns (mean, median, std deviation, etc.).
- [ ] Generate counts/frequencies for categorical columns.

## 6. Data Visualization
### 6.1 Univariate Analysis
- [ ] Plot histograms/bar charts for individual columns.
- [ ] Generate box plots for numerical columns to inspect outliers.

### 6.2 Bivariate Analysis
- [ ] Generate scatter plots to study relationships between numerical columns.
- [ ] Generate stacked bar charts/cross tables for categorical columns.

### 6.3 Correlation Analysis
- [ ] Compute correlation matrix for numerical columns.
- [ ] Visualize the correlation matrix using a heatmap.

## 7. Feature Engineering
- [ ] Create new features if necessary.
- [ ] Drop redundant or irrelevant features.

## 8. Conclusion and Next Steps
- Summarize the main insights from the EDA.
- Provide recommendations or plan further analysis or modeling based on the findings.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp
import sklearn as sk
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import StratifiedKFold


In [5]:
df = pd.read_csv('train.csv')


# Data exploration
#pp.ProfileReport(df)
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [3]:
#concat train and test data
df_test = pd.read_csv('test.csv')
df_train = pd.read_csv('train.csv')
df = pd.concat([df_train, df_test], axis=0, ignore_index=True)
df.to_csv('train_concat.csv', index=False)


In [13]:
df = pd.read_csv('train.csv')
def preprocessing(df):

    def calculate_group_size(passenger_id, df):
        # Extract the prefix from the 'PassengerId'
        prefix = passenger_id.split('_')[0]
        
        # Count the occurrences of the same prefix in the entire DataFrame
        group_size = len(df[df['PassengerId'].str.startswith(prefix)])
        
        return group_size

    # Apply the function to create the 'TravelGroupSize' column
    df['TravelGroupSize'] = df.apply(lambda row: calculate_group_size(row['PassengerId'], df), axis=1)

    # Filling missing values for HomePlanet
    df['HomePlanet'] = df['HomePlanet'].fillna('Earth')

    # Filling missing values for CryoSleep
    df['CryoSleep'] = df['CryoSleep'].fillna(False)

    # Filling missing values for Cabin
    df['Cabin'] = df['Cabin'].fillna('Unknown')
    
    # Cabin split
    mask = df['Cabin'] != "Unknown"

    # Split 'Cabin' into 'Deck', 'Num', and 'Side' columns
    split_values = df['Cabin'].str.split('/', expand=True)

    # Assign the split values to 'Deck' and 'Side' for non-"Unknown" rows
    df.loc[mask, 'Deck'] = split_values[0]
    df.loc[mask, 'Side'] = split_values[2]

    # Fill missing
    df['Deck'] = df['Deck'].fillna("F")
    df['Side'] = df['Side'].fillna("S")

    # Filling missing values for Destination
    df['Destination'] = df['Destination'].fillna('TRAPPIST-1e')

    # Filling missing values for Age
    df['Age'] = df['Age'].fillna(df['Age'].mean())

    # Filling missing values for VIP
    df['VIP'] = df['VIP'].fillna(False)

    # Filling missing values for amenities
    df['FoodCourt'] = df['FoodCourt'].fillna(0)
    df['ShoppingMall'] = df['ShoppingMall'].fillna(0)
    df['Spa'] = df['Spa'].fillna(0)
    df['VRDeck'] = df['VRDeck'].fillna(0)
    df['RoomService'] = df['RoomService'].fillna(0)


    # Less important features
    df = df.drop(columns=['Name', 'PassengerId', 'Cabin'])
    df = df.astype({'HomePlanet': 'category',
                    'CryoSleep': 'bool',
                    'Deck': 'category',
                    'Side': 'category',
                    'Destination': 'category',
                    'Age': 'int8',
                    'VIP': 'bool',
                    'FoodCourt': 'int16',
                    'ShoppingMall': 'int16',
                    'Spa': 'int16',
                    'VRDeck': 'int16',
                    'TravelGroupSize': 'int8',
                    'RoomService': 'int16',
                    'Transported': 'bool'})

    return df

# Assuming you have a DataFrame named 'df' that you want to preprocess
df = preprocessing(df)
#change path if needed for concat file
# df.to_csv('train_preprocessed_concat.csv', index=False)
display(df.shape)
display(df.dtypes)




(8693, 14)

HomePlanet         category
CryoSleep              bool
Destination        category
Age                    int8
VIP                    bool
RoomService           int16
FoodCourt             int16
ShoppingMall          int16
Spa                   int16
VRDeck                int16
Transported            bool
TravelGroupSize        int8
Deck               category
Side               category
dtype: object

In [40]:
df = pd.read_csv('train_preprocessed.csv')
# Separate features and target variable
X = df.drop(columns=["Transported"])
y = df["Transported"]

# Define numeric and categorical features
numeric_features = ['Age', 'TravelGroupSize', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'RoomService']
categorical_features = ['HomePlanet', 'Deck', 'Side', 'Destination', 'VIP', 'CryoSleep']

# Create transformers for numeric and categorical features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Use ColumnTransformer to apply transformers to the appropriate columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define a list of classifiers and their respective hyperparameter grids
classifiers = {
     
    #.79 with non concat with moneatary columns
    # 'KNN': (KNeighborsClassifier(), {
    #     'classifier__n_neighbors': [30],
    #     'classifier__weights': ['uniform'],
    #     'classifier__metric': ['euclidean'],
    #     'classifier__algorithm': ['kd_tree'],
    #     'classifier__leaf_size': [5],
    # }),

    #with non concat with moneatary columns
    # Best SVM Model: {'classifier__C': 15, 'classifier__class_weight': 'balanced', 'classifier__gamma': 0.005, 'classifier__kernel': 'rbf', 'classifier__probability': True, 'classifier__shrinking': True}
    # Best SVM Accuracy: 0.80007
    # 'SVM': (SVC(), {
    #      'classifier__C': [15],
    #      'classifier__kernel': ['rbf'],
    #      'classifier__gamma': [0.005],
    #      'classifier__probability': [True],
    #      'classifier__class_weight': ['balanced'],
    #     'classifier__shrinking': [True],
    # }),


    
    #.79 with non concat with moneatary columns
    'Logistic Regression': (LogisticRegression(),
                                {'classifier__C': [10],
                                'classifier__solver': ['saga'],
                                'classifier__max_iter': [1000],
                                'classifier__class_weight': [None],
                                'classifier__tol': [1e-5]})

   

    # Best Decision Tree Model: {'classifier__class_weight': None, 'classifier__criterion': 'entropy', 'classifier__max_depth': 14, 'classifier__max_features': None, 'classifier__min_samples_leaf': 8, 'classifier__min_samples_split': 5, 'classifier__splitter': 'random'}
    # Best Decision Tree Accuracy: 0.78039
    # 'Decision Tree': (DecisionTreeClassifier(),
    #                    {'classifier__max_depth': [14],
    #                     'classifier__min_samples_split': [5],
    #                     'classifier__min_samples_leaf': [8],
    #                     'classifier__max_features': [None],
    #                     'classifier__criterion': ['entropy'],
    #                     'classifier__splitter': ['random'],
    #                     'classifier__class_weight': [None]}),
    
    
    
    
    # with non concat with moneatary columns
    # Best Random Forest Model: {'classifier__bootstrap': True, 'classifier__class_weight': 'balanced', 'classifier__criterion': 'gini', 'classifier__max_depth': 15, 'classifier__max_features': 'auto', 'classifier__min_samples_leaf': 5, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 200, 'classifier__n_jobs': -1, 'classifier__random_state': 42}
    # Best Random Forest Accuracy: 0.80398

    # 'Random Forest': (RandomForestClassifier(), {
    # 'classifier__n_estimators': list(range(200,250,10)),
    # 'classifier__max_depth': list(range(15,20))
    # 'classifier__max_features': ['auto'],
    # 'classifier__min_samples_split': list(range(2,10)),
    # 'classifier__min_samples_leaf': list(range(1,10,2))
    # 'classifier__bootstrap': [True],
    # 'classifier__criterion': ['gini'],
    # 'classifier__class_weight': ['balanced'],
    # 'classifier__random_state': [42],
    # 'classifier__n_jobs': [-1]
    #     }),
    
    # with non concat with moneatary columns
    #Best Gradient Boosting Model: {'classifier__learning_rate': 0.19, 'classifier__loss':
    # 'deviance', 'classifier__max_depth': 3, 'classifier__max_features': 'log2', 
    # 'classifier__min_samples_leaf': 8, 'classifier__min_samples_split': 8,
    # 'classifier__n_estimators': 300, 'classifier__random_state': 42,
    # 'classifier__subsample': 1.0, 'classifier__warm_start': True}
    #Best Gradient Boosting Accuracy: 0.80685
    # 'Gradient Boosting': (GradientBoostingClassifier(), 
    #                        {'classifier__learning_rate': np.arange(0.05, 0.2, 0.01),
    #                         'classifier__loss':['deviance'],
    #                         'classifier__max_depth': list(range(3, 10)),
    #                         'classifier__max_features': ['log2'], 
    #                         'classifier__min_samples_leaf': list(range(1, 10)),
    #                         'classifier__min_samples_split': list(range(2, 10)),
    #                         'classifier__n_estimators': list(range(100, 500, 25)),
    #                         'classifier__random_state': [42],
    #                         'classifier__subsample': np.arange(0.5, 1.0, 0.1),
    #                         'classifier__warm_start': [True]}),}


# Create a dictionary to store the best models
best_models = {}
classifier_accuracies = {}

# Iterate through the classifiers and perform GridSearchCV
for clf_name, (clf, param_grid) in classifiers.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', clf)])
    
    stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

    grid_search = GridSearchCV(pipeline, param_grid, cv=stratified_kfold, scoring='accuracy')
    grid_search.fit(X, y)
    
    best_models[clf_name] = grid_search.best_estimator_
    best_accuracy = grid_search.best_score_
    classifier_accuracies[clf_name] = best_accuracy
    print(f'Best {clf_name} Model: {grid_search.best_params_}')
    print(f'Best {clf_name} Accuracy: {grid_search.best_score_:.5f}')
# Create a bar plot to compare classifier accuracies

# import matplotlib.pyplot as plt

# # Extract classifier names and accuracy scores
# classifiers = list(classifier_accuracies.keys())
# accuracies = list(classifier_accuracies.values())

# # Create a bar plot
# plt.figure(figsize=(10, 6))
# plt.bar(classifiers, accuracies, color=['red', 'green', 'blue', 'cyan', 'yellow', 'orange', 'purple'])
# plt.xlabel('Classifier')
# plt.ylabel('Accuracy')
# plt.title('Classifier Comparison')
# plt.ylim(0, 1)  # Set the y-axis limit to the range of accuracy (0 to 1)
# plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

# # Show the plot
# plt.tight_layout()
# plt.show()


# Now you can access the best models using the 'best_models' dictionary.
# For example, to use the best Random Forest Classifier:
# best_rf_model = best_models['Random Forest']

SyntaxError: invalid syntax (1837844022.py, line 116)

In [None]:
final_model=best_models['Gradient Boosting']
print(final_model)

In [100]:
final_model=best_models['Gradient Boosting']
final_model.fit(X,y)


X_test = pd.read_csv('test_preprocessed.csv')


# Load the original test data
original_test_data = pd.read_csv('test.csv')

# Extract the "PassengerId" column
passenger_id_df = original_test_data[['PassengerId']]


y_pred = final_model.predict(X_test)

# Create a DataFrame for the predicted "Transported" values
predicted_df = pd.DataFrame({'Transported': y_pred})

# Concatenate the "PassengerId" and predicted "Transported" DataFrames
submission_df = pd.concat([passenger_id_df, predicted_df], axis=1)

# Save the concatenated DataFrame to a CSV file
submission_df.to_csv('submission_grad_boost.csv', index=False)

In [98]:
df=pd.read_csv('submission_grad_boost.csv')
df.head(50)

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True
5,0027_01,False
6,0029_01,True
7,0032_01,True
8,0032_02,True
9,0033_01,True
