### Elevator Pitch
The Star Wars survey data from FiveThirtyEight was cleaned and formatted to build a machine learning model predicting whether a respondent earns more than $50,000. Key insights include that survey responses align with original visuals, and the final model achieves high accuracy, demonstrating the predictive potential of preference data.


#### Libraries

In [1]:
#| label: libraries
#| include: true

import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
from sklearn import tree
import matplotlib.pyplot as plt
from lets_plot import *

LetsPlot.setup_html()

## Question|Task 1

__Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.__


In [2]:
#| label: project-data
#| code-summary: Load the data
#| include: false

df = pd.read_csv("StarWars.csv", encoding="ISO-8859-1")

In [3]:
#| include: false

df = df.iloc[1:] # eliminate first row
df.info()

df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 1 to 1186
Data columns (total 38 columns):
 #   Column                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                         --------------  -----  
 0   RespondentID                                                                                                                                   1186 non-null   float64
 1   Have you seen any of the 6 films in the Star Wars franchise?                                                                                   1186 non-null   object 
 2   Do you consider yourself to be a fan of the Star Wars film franchise?                                                                          836 non-null    object 
 3   Which of the following Star 

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expan

In [4]:
#| include: false

star_wars = df[pd.notnull(df['RespondentID'])]

star_wars.head(5)

Unnamed: 0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.,...,Unnamed: 28,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
1,3292880000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3.0,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
2,3292880000.0,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
3,3292765000.0,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1.0,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
4,3292763000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
5,3292731000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central


In [5]:
#| include: false

print(star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts())
print(star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts())

Have you seen any of the 6 films in the Star Wars franchise?
Yes    936
No     250
Name: count, dtype: int64
Do you consider yourself to be a fan of the Star Wars film franchise?
Yes    552
No     284
Name: count, dtype: int64


In [6]:
#| label: mapping
#| code-summary: mapping
#| include: true

yes_no={
    'Yes': True,
    'No': False
}

yes_no_cols = ['Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?']

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)

star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)

star_wars['Do you consider yourself to be a fan of the Expanded Universe?æ'] = star_wars['Do you consider yourself to be a fan of the Expanded Universe?æ'].map(yes_no)

star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'].map(yes_no)

In [7]:
#| include: false

# after cleaning
print(star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts())
print(star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts())

Have you seen any of the 6 films in the Star Wars franchise?
True     936
False    250
Name: count, dtype: int64
Do you consider yourself to be a fan of the Star Wars film franchise?
True     552
False    284
Name: count, dtype: int64


In [8]:
#| label: cleaning
#| code-summary: cleaning
#| include: true

cols_seen = {
    'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
    'Unnamed: 4': 'seen_2',
    'Unnamed: 5': 'seen_3',
    'Unnamed: 6': 'seen_4',
    'Unnamed: 7': 'seen_5',
    'Unnamed: 8': 'seen_6'    
}

star_wars = star_wars.rename(columns=cols_seen)

In [9]:
#| include: false

star_wars.columns[3:9]

Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')

In [10]:
#| label: cleaning_1
#| code-summary: cleaning_1
#| include: true

seen_notseen = {
    
    'seen_notseen_1': {
        star_wars.iloc[0,3]: True,
        np.nan: False
    },

    'seen_notseen_2': {
        star_wars.iloc[0,4]: True,
        np.nan: False
    },

    'seen_notseen_3': {
        star_wars.iloc[0,5]: True,
        np.nan: False
    },
    
    'seen_notseen_4': {
        star_wars.iloc[0,6]: True,
        np.nan: False
    },
    
    'seen_notseen_5': {
        star_wars.iloc[0,7]: True,
        np.nan: False
    },

    'seen_notseen_6': {
        star_wars.iloc[0,8]: True,
        np.nan: False
    },
}


for movie in range(1,7):
    star_wars['seen_' + str(movie)] = star_wars['seen_' + str(movie)].map(seen_notseen['seen_notseen_' + str(movie)])

In [11]:
#| label: hot_encoding1
#| code-summary: hot_encoding1
#| include: true

star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

cols_rank = {
    'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1',
    'Unnamed: 10': 'ranking_2',
    'Unnamed: 11': 'ranking_3',
    'Unnamed: 12': 'ranking_4',
    'Unnamed: 13': 'ranking_5',
    'Unnamed: 14': 'ranking_6'    
}

star_wars = star_wars.rename(columns=cols_rank)

In [12]:
#| label: hot_encoding
#| code-summary: hot_encoding
#| include: true

male_female={
    'Male': 1,
    'female': 0
}

ages={
    '18-29': 1,
    '30-44': 2,
    '45-60': 3,
    '> 60': 4
}

income = {
    '$0 - $24,999': (24999),
    '$25,000 - $49,999': (49999),
    '$50,000 - $99,999': (99999),
    '$100,000 - $149,999': (149999),
    '$150,000+': (200000)  # Upper limit for simulation
}


education={
    'Less than high school degree  ': 1,
    'High school degree ': 2,
    'Some college or Associate degree': 3,
    'Bachelor degree': 4,
    'Graduate degree': 5
}


star_wars['Gender'] = star_wars['Gender'].map(male_female)
star_wars['Age'] = star_wars['Age'].map(ages)
star_wars['Household Income'] = star_wars['Household Income'].map(income)  # Random income
star_wars['Education'] = star_wars['Education'].map(education)

In [13]:
#| include: false

# displayed max columns
pd.set_option('display.max_columns', None)

#unique values
star_wars['Household Income'].value_counts()

Household Income
99999.0     298
49999.0     186
149999.0    141
24999.0     138
200000.0     95
Name: count, dtype: int64

In [14]:
#| include: false

star_wars_drop = star_wars.drop(star_wars.columns[15:31], axis=1)

star_wars_drop.head()

Unnamed: 0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,seen_1,seen_2,seen_3,seen_4,seen_5,seen_6,ranking_1,ranking_2,ranking_3,ranking_4,ranking_5,ranking_6,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
1,3292880000.0,True,True,True,True,True,True,True,True,3.0,2.0,1.0,4.0,5.0,6.0,False,False,1.0,1.0,,,South Atlantic
2,3292880000.0,False,,False,False,False,False,False,False,,,,,,,,True,1.0,1.0,24999.0,4.0,West South Central
3,3292765000.0,True,False,True,True,True,False,False,False,1.0,2.0,3.0,4.0,5.0,6.0,,False,1.0,1.0,24999.0,,West North Central
4,3292763000.0,True,True,True,True,True,True,True,True,5.0,6.0,1.0,2.0,4.0,3.0,,True,1.0,1.0,149999.0,3.0,West North Central
5,3292731000.0,True,True,True,True,True,True,True,True,5.0,4.0,6.0,2.0,1.0,3.0,False,False,1.0,1.0,149999.0,3.0,West North Central


In [15]:
#| include: false
#show unique values in the column gener and count

counts_gender = star_wars_drop['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()
print(counts_gender)

Have you seen any of the 6 films in the Star Wars franchise?
True     936
False    250
Name: count, dtype: int64


In [16]:
#| label: name_clean
#| code-summary: name_clean
#| include: true

star_wars_names = star_wars_drop.rename(
    columns=
        {
        'Have you seen any of the 6 films in the Star Wars franchise?': 'Seen_any_film',
        'Do you consider yourself to be a fan of the Star Wars film franchise?':'Are_you_fan',
        'Do you consider yourself to be a fan of the Expanded Universe?æ':'fan_expanded_universe',
        'Do you consider yourself to be a fan of the Star Trek franchise?':'fan_star_trek',
        'Household Income':'Household_Income',
        'Location (Census Region)':'location'
        }
    )

star_wars_names.head(5)

Unnamed: 0,RespondentID,Seen_any_film,Are_you_fan,seen_1,seen_2,seen_3,seen_4,seen_5,seen_6,ranking_1,ranking_2,ranking_3,ranking_4,ranking_5,ranking_6,fan_expanded_universe,fan_star_trek,Gender,Age,Household_Income,Education,location
1,3292880000.0,True,True,True,True,True,True,True,True,3.0,2.0,1.0,4.0,5.0,6.0,False,False,1.0,1.0,,,South Atlantic
2,3292880000.0,False,,False,False,False,False,False,False,,,,,,,,True,1.0,1.0,24999.0,4.0,West South Central
3,3292765000.0,True,False,True,True,True,False,False,False,1.0,2.0,3.0,4.0,5.0,6.0,,False,1.0,1.0,24999.0,,West North Central
4,3292763000.0,True,True,True,True,True,True,True,True,5.0,6.0,1.0,2.0,4.0,3.0,,True,1.0,1.0,149999.0,3.0,West North Central
5,3292731000.0,True,True,True,True,True,True,True,True,5.0,4.0,6.0,2.0,1.0,3.0,False,False,1.0,1.0,149999.0,3.0,West North Central


## Question|Task 2

__Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.__

- Filter the dataset to respondents that have seen at least one film.
- Create a new column that converts the age ranges to a single number. Drop the age range categorical column.
- Create a new column that converts the education groupings to a single number. Drop the school categorical column.
- Create a new column that converts the income ranges to a single number. Drop the income range categorical column.
- Create your target (also known as “y” or “label”) column based on the new income range column.
- One-hot encode all remaining categorical columns.


In [17]:
#| label: seen_one_film
#| code-summary: seen_one_film
#| include: true

#Filter the dataset to respondents that have seen at least one film
star_wars_names['seen_any_real'] = star_wars_names[['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6']].any(axis=1)
filtered_df = star_wars_names[star_wars_names['seen_any_real'] == True]
filtered_df.head(5)

Unnamed: 0,RespondentID,Seen_any_film,Are_you_fan,seen_1,seen_2,seen_3,seen_4,seen_5,seen_6,ranking_1,ranking_2,ranking_3,ranking_4,ranking_5,ranking_6,fan_expanded_universe,fan_star_trek,Gender,Age,Household_Income,Education,location,seen_any_real
1,3292880000.0,True,True,True,True,True,True,True,True,3.0,2.0,1.0,4.0,5.0,6.0,False,False,1.0,1.0,,,South Atlantic,True
3,3292765000.0,True,False,True,True,True,False,False,False,1.0,2.0,3.0,4.0,5.0,6.0,,False,1.0,1.0,24999.0,,West North Central,True
4,3292763000.0,True,True,True,True,True,True,True,True,5.0,6.0,1.0,2.0,4.0,3.0,,True,1.0,1.0,149999.0,3.0,West North Central,True
5,3292731000.0,True,True,True,True,True,True,True,True,5.0,4.0,6.0,2.0,1.0,3.0,False,False,1.0,1.0,149999.0,3.0,West North Central,True
6,3292719000.0,True,True,True,True,True,True,True,True,1.0,4.0,3.0,6.0,5.0,2.0,False,True,1.0,1.0,49999.0,4.0,Middle Atlantic,True


Question 2, 3, and 4 were completed in previous codes labels : `cleaning_1`, `hot_encoding1`, `hot_encoding`


In [18]:
#| label: target_y
#| code-summary: target_y
#| include: true

filtered_df = filtered_df.rename(columns={'Household_Income': 'y'})  # Rename the column
filtered_df['y_target'] = filtered_df['y']  # Assign the renamed column to 'y_target'
filtered_df.head(5)

Unnamed: 0,RespondentID,Seen_any_film,Are_you_fan,seen_1,seen_2,seen_3,seen_4,seen_5,seen_6,ranking_1,ranking_2,ranking_3,ranking_4,ranking_5,ranking_6,fan_expanded_universe,fan_star_trek,Gender,Age,y,Education,location,seen_any_real,y_target
1,3292880000.0,True,True,True,True,True,True,True,True,3.0,2.0,1.0,4.0,5.0,6.0,False,False,1.0,1.0,,,South Atlantic,True,
3,3292765000.0,True,False,True,True,True,False,False,False,1.0,2.0,3.0,4.0,5.0,6.0,,False,1.0,1.0,24999.0,,West North Central,True,24999.0
4,3292763000.0,True,True,True,True,True,True,True,True,5.0,6.0,1.0,2.0,4.0,3.0,,True,1.0,1.0,149999.0,3.0,West North Central,True,149999.0
5,3292731000.0,True,True,True,True,True,True,True,True,5.0,4.0,6.0,2.0,1.0,3.0,False,False,1.0,1.0,149999.0,3.0,West North Central,True,149999.0
6,3292719000.0,True,True,True,True,True,True,True,True,1.0,4.0,3.0,6.0,5.0,2.0,False,True,1.0,1.0,49999.0,4.0,Middle Atlantic,True,49999.0


## Question|Task 3

__Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.__


In [19]:
#| label: question_3
#| code-summary: question_3
#| include: true

# wider seen
df_melted_q1 = filtered_df.melt(
    id_vars=['RespondentID'],
    value_vars=['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'],
    var_name='movies',
    value_name= 'test'
)

In [20]:
#| label: question_3-1
#| code-summary: question_3-1
#| include: true

grouped_counts = df_melted_q1[df_melted_q1['test'] == True].groupby('movies')['RespondentID'].count().reset_index()
grouped_counts.columns = ['movies', 'count']

In [21]:
#| label: question_3-2
#| code-summary: question_3-2
#| include: true

total_count = grouped_counts['count'].sum()
grouped_counts['percentage'] = ((grouped_counts['count'] / 835) * 100).round(0)

In [22]:
#| label: question_graph
#| code-summary: question_graph
#| include: false

from plotnine import ggplot, aes, geom_bar, labs, theme_minimal

name_movies={
    'seen_1': 'The Phantom Menace',
    'seen_2': 'Attack of the Clones',
    'seen_3': 'Revenge of the Sith',
    'seen_4': 'A New Hope',
    'seen_5': 'The Empire Strikes Back',
    'seen_6': 'Return of the Jedi'
}

grouped_counts['movies'] = grouped_counts['movies'].map(name_movies)

In [23]:
#| label: question_graph2
#| code-summary: question_graph2
#| include: true

from plotnine import ggplot, aes, geom_bar, labs, geom_text, theme, element_text

plot = (
    ggplot(grouped_counts, aes(x='movies', y='percentage')) +
    geom_bar(stat='identity', fill='darkblue') +
    geom_text(aes(label='percentage'), va='bottom', ha='center', color='Black', size=10) +  # Adding percentage labels
    labs(
        title='Unique Movies Seen by Respondents',
        x='Movies',
        y='Percentage of Respondents'
    ) +
    theme(
        axis_text_x=element_text(rotation=90, hjust=1),  # Rotate x-axis labels
        plot_title=element_text(size=16, face='bold'),
        plot_subtitle=element_text(size=12)
    )
)

print(plot)
plot.save('plot2.png') 

<ggplot: (672 x 480)>




![Picture_1](plot2.png)


In [24]:
#| label: ques_graph2
#| code-summary: que_graph-2
#| include: true

star_wars_names['seen_all_true'] = star_wars_names[['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6']].all(axis=1)

filtered_df = star_wars_names[star_wars_names['seen_all_true'] == True]

In [25]:
#| label: q2
#| code-summary: q2
#| include: true

import pandas as pd

# Melt seen columns
df_meltedq = filtered_df.melt(
    id_vars=['RespondentID'],
    value_vars=['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'],
    var_name='movies',
    value_name='test'

)
# Melt ranking columns
df_meltedq1 = filtered_df.melt(
    id_vars=['RespondentID'],
    value_vars=['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6'],
    var_name='movies',
    value_name='ranking'
)

# Extract the numeric part of the 'movies' column
df_meltedq['movies'] = df_meltedq['movies'].str.extract('(\d+)', expand=False)
df_meltedq1['movies'] = df_meltedq1['movies'].str.extract('(\d+)', expand=False)

# Merge on RespondentID and movies
result = pd.merge(df_meltedq, df_meltedq1, on=['RespondentID', 'movies'])

In [26]:
#| include: true

filtered = result[result['ranking'] == 5]

In [27]:
#| include: true

from plotnine import ggplot, aes, geom_bar, labs, theme_minimal

name_movies={
    '6': 'The Panthon Menace',
    '5': 'Attack of the Clones',
    '4': 'Revenge of the Sith',
    '3': 'A New Hope',
    '2': 'The Empire Strikes Back',
    '1': 'Return of the Jedi'
}

filtered['movies'] = filtered['movies'].map(name_movies)

grouped_counts = filtered.groupby('movies')['ranking'].count().reset_index()
grouped_counts.columns = ['movies', 'count']

total_count = grouped_counts['count'].sum()
grouped_counts['percentage'] = ((grouped_counts['count'] / 471) * 100).round(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [28]:
#| label: question_graph3
#| code-summary: question_graph3
#| include: true

from plotnine import ggplot, aes, geom_bar, labs, geom_text, theme, element_text

plot = (
    ggplot(grouped_counts, aes(x='movies', y='percentage')) +
    geom_bar(stat='identity', fill='darkblue') +
    geom_text(aes(label='percentage'), va='bottom', ha='center', color='Black', size=10) +  # Adding percentage labels
    labs(
        title='What is the best star ward movies ',
        subtitle='Of 471 respondents who have seen all 6 movies',
        x='Movies',
        y='Percentage of Respondents'
    ) +
    theme(
        axis_text_x=element_text(rotation=90, hjust=1),  # Rotate x-axis labels
        plot_title=element_text(size=16, face='bold'),
        plot_subtitle=element_text(size=12)
    )
)

print(plot)
plot.save('plot1.png') 



<ggplot: (672 x 480)>


![Picture_1](plot1.png)

## Stretch Question|Task

__Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.__

I trained a Random Forest Classifier model to predict whether a respondent earns more than $50,000 based on survey data. The model's accuracy is 80.62%.


In [29]:
#| label: training
#| code-summary: training
#| include: true

star_wars_names['ml_prep'] = star_wars_names['Household_Income'].apply(lambda x: '1' if x > 50000 else '0')

star_wars_names.head(5)

Unnamed: 0,RespondentID,Seen_any_film,Are_you_fan,seen_1,seen_2,seen_3,seen_4,seen_5,seen_6,ranking_1,ranking_2,ranking_3,ranking_4,ranking_5,ranking_6,fan_expanded_universe,fan_star_trek,Gender,Age,Household_Income,Education,location,seen_any_real,seen_all_true,ml_prep
1,3292880000.0,True,True,True,True,True,True,True,True,3.0,2.0,1.0,4.0,5.0,6.0,False,False,1.0,1.0,,,South Atlantic,True,True,0
2,3292880000.0,False,,False,False,False,False,False,False,,,,,,,,True,1.0,1.0,24999.0,4.0,West South Central,False,False,0
3,3292765000.0,True,False,True,True,True,False,False,False,1.0,2.0,3.0,4.0,5.0,6.0,,False,1.0,1.0,24999.0,,West North Central,True,False,0
4,3292763000.0,True,True,True,True,True,True,True,True,5.0,6.0,1.0,2.0,4.0,3.0,,True,1.0,1.0,149999.0,3.0,West North Central,True,True,1
5,3292731000.0,True,True,True,True,True,True,True,True,5.0,4.0,6.0,2.0,1.0,3.0,False,False,1.0,1.0,149999.0,3.0,West North Central,True,True,1


In [30]:
#| label: replace
#| code-summary: replace
#| include: true

for column in star_wars_names.columns:
    most_common_value = star_wars_names[column].mode()[0]  # Get the mode (most frequent value) of the column
    star_wars_names[column].fillna(most_common_value, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.




In [31]:
#| include: false

print(star_wars_names.isna().sum())

print((star_wars_names.isna().mean() * 100).round(2))

RespondentID             0
Seen_any_film            0
Are_you_fan              0
seen_1                   0
seen_2                   0
seen_3                   0
seen_4                   0
seen_5                   0
seen_6                   0
ranking_1                0
ranking_2                0
ranking_3                0
ranking_4                0
ranking_5                0
ranking_6                0
fan_expanded_universe    0
fan_star_trek            0
Gender                   0
Age                      0
Household_Income         0
Education                0
location                 0
seen_any_real            0
seen_all_true            0
ml_prep                  0
dtype: int64
RespondentID             0.0
Seen_any_film            0.0
Are_you_fan              0.0
seen_1                   0.0
seen_2                   0.0
seen_3                   0.0
seen_4                   0.0
seen_5                   0.0
seen_6                   0.0
ranking_1                0.0
ranking_2             

In [32]:
#| label: building_model
#| code-summary: building_model
#| include: true

X = star_wars_names[['RespondentID', 'Seen_any_film', 'seen_1', 'seen_2',
       'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1', 'ranking_2',
       'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
        'Gender', 'Age',
       'Household_Income', 'Education']]

y=star_wars_names['ml_prep']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
          X, y, test_size=0.3, random_state=1)

In [33]:
#| label: Functions
#| code-summary: function
#| include: true

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

In [34]:
#| label: running_model1
#| code-summary: running_model1
#| include: true

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

print_score(clf, X_train, y_train, X_test, y_test, train=True)
print_score(clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
               0      1  accuracy  macro avg  weighted avg
precision    1.0    1.0       1.0        1.0           1.0
recall       1.0    1.0       1.0        1.0           1.0
f1-score     1.0    1.0       1.0        1.0           1.0
support    459.0  371.0       1.0      830.0         830.0
_______________________________________________
Confusion Matrix: 
 [[459   0]
 [  0 371]]

Test Result:
Accuracy Score: 80.62%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.829787    0.779762   0.80618    0.804775      0.806882
recall       0.808290    0.803681   0.80618    0.805986      0.806180
f1-score     0.818898    0.791541   0.80618    0.805219      0.806372
support    193.000000  163.000000   0.80618  356.000000    356.000000
_______________________________________________