### Walkthrough[](https://byuistats.github.io/DS250-Course/Projects/project_5.html#walkthrough)

### Background[](https://byuistats.github.io/DS250-Course/Projects/project_5.html#background)

Survey data is notoriously difficult to munge. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.

In 2014, FiveThirtyEight surveyed over 1000 people to write the article titled, [America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters)](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/). They have provided the data on [GitHub](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey).

For this project, your client would like to use the Star Wars survey data to figure out if they can predict an interviewing job candidate’s current income based on a few responses about Star Wars movies.

### Client Request[](https://byuistats.github.io/DS250-Course/Projects/project_5.html#client-request)

The Client is who performed the survey but outsourced the analitics to a 3rd party. They want you to clean up the data so you can:

a. Validate the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article

b. Determine if you predict if a person from the survey makes more than $50k

### Data[](https://byuistats.github.io/DS250-Course/Projects/project_5.html#data)

**Download:** [StarWars.csv](https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv)  
**Information:** [Article](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/)

### Readings[](https://byuistats.github.io/DS250-Course/Projects/project_5.html#readings)

-   [Python for Data Science: Tidy Data](https://byuidatascience.github.io/python4ds/tidy-data.html)
-   [Python for Data Science: Graphics for Communication](https://byuidatascience.github.io/python4ds/graphics-for-communication.html)
-   [Python for Data Science: Strings](https://byuidatascience.github.io/python4ds/strings.html)

#### Optional References[](https://byuistats.github.io/DS250-Course/Projects/project_5.html#optional-references)

-   [Why to not use get\_dummies](https://digestize.medium.com/why-is-using-get-dummies-a-bad-idea-for-your-ml-project-bcfd2683d2e4)

### Questions and Tasks (Core)[](https://byuistats.github.io/DS250-Course/Projects/project_5.html#questions-and-tasks-core)

1.  **Shorten the column names and clean them up for easier use with pandas.** Provide a table or list that exemplifies how you fixed the names.
    
2.  **Clean and format the data so that it can be used in a machine learning model.** As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.
    
    1.  Filter the dataset to respondents that have seen at least one film  
    2.  Create a new column that converts the age ranges to a single number. Drop the age range categorical column  
    3.  Create a new column that converts the education groupings to a single number. Drop the school categorical column  
    4.  Create a new column that converts the income ranges to a single number. Drop the income range categorical column  
    5.  Create your target (also known as “y” or “label”) column based on the new income range column  
    6.  One-hot encode all remaining categorical columns
   
3.  **Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.**
4.  **Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.**
    

### Questions and Tasks (Stretch)[](https://byuistats.github.io/DS250-Course/Projects/project_5.html#questions-and-tasks-stretch)

Here is an example Stretch question(s) for this project. Your instructor may assign different Stretch question(s). You must comment in Canvas when submitting your project if you completed any of the Stretch questions.

1.  **Build a machine learning model that predicts whether a person makes more than $50k. With accuracy of at least 65%. Describe your model and report the accuracy.**
2.  **Validate the data provided on GitHub lines up with the article by recreating a 3rd visual from the article.**
3.  **Create a new colum that converts the location groupings to a single number. Drop the location categorical column.**
    

### Deliverables[](https://byuistats.github.io/DS250-Course/Projects/project_5.html#deliverables)

_Use this [template](https://byuistats.github.io/DS250-Course/Templates/ds250_project_template_clean.qmd) to submit your Client Report. The template has two sections:_

1.  _A short elevator pitch that highlights key values or metrics from the results. Describing these key insights to interest or hook the reader to want to read more about your work. The writing style should be more technical with some creative elements. Do not summarize what you did._  
2.  _Answers to the questions | tasks. Each answer should include a written description of your results, code cells with comments, charts, and/or tables._

_Your report should be written in quarto markdown files and pushed to GitHub which will render it to HTML. Submit a URL of the rendered project in Canvas. (Do not submit the URL to the GitHub `.qmd` file)_

In [7]:
# Standard imports
import numpy as np
import pandas as pd
import plotly.express as plt

url = "https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv"

df = pd.read_csv("./Data/StarWars.csv", encoding="ISO-8859-1", header=None)

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
0,RespondentID,Have you seen any of the 6 films in the Star W...,Do you consider yourself to be a fan of the St...,Which of the following Star Wars films have yo...,,,,,,Please rank the Star Wars films in order of pr...,...,,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Ex...,Do you consider yourself to be a fan of the St...,Gender,Age,Household Income,Education,Location (Census Region)
1,,Response,Response,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,Star Wars: Episode I The Phantom Menace,...,Yoda,Response,Response,Response,Response,Response,Response,Response,Response,Response
2,3292879998,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
3,3292879538,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
4,3292765271,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
