# DTSC-580: Data Manipulation
## Assignment: Survivor

### Name:

## Overview

In this assignment, you will test all the skills that you have learned during this course to manipulate the provided data to find the answers to questions about the TV show Survivor.  If you are not familiar with this show, start by watching this short clip that briefly explains it:  [Survivor Explained](https://www.youtube.com/watch?v=l1-hTpG_krk)

Please note that your notebook should be named `survivor` when submitting to CodeGrade for the automatic grading to work properly.

## Data

The Survivor data is a R package from Daniel Oehm.  Daniel has made the data for this package available as an Excel file as explained in his article on [gradientdescending.com](http://gradientdescending.com/survivor-data-from-the-tv-series-in-r/).  Please make sure that you use the file from our Brightspace page though to make sure that your data will match what CodeGrade is expecting.  We have also updated some errors in the file, which is another reason that you must use the data given to you.

You need to first read the article on the website linked above.  This will give you additional details about the data that will be important as you answer the questions below.

Please note that there is a data dictionary in the file that explains the columns in the data.  You will also want to become familiar with the various spreadsheets and column names.

Finally, here are a couple of things to know for those of you that have not seen the show:
- Survivor is a reality TV show that first aired May 31, 2000 and is currently still on TV.
- Contestants are broken up into two teams (usually) where they live in separate camps. 
- The teams compete in various challenges for rewards (food, supplies, brief experience trips, etc) and tribal immunity.  
- The team that loses a challenge, and therefore doesn't get the tribal immunity, goes to tribal council where they have to vote one of their members out (this data is represented in the "Vote History" spreadsheet).
- After there are a small number of contestants left, the tribes are merged into one tribe where each contestant competes for individual immunity.  The winner of the individual immunity cannot get voted out and is safe at the next tribal council.
- The are also hidden immunity idols that are hidden around the campground.  If a contestant finds and plays their hidden immunity at the tribal council, then all votes against them do not count, and the player with the next highest number of votes goes home.
- When the contestants get down to 2 or 3 people, a number of the last contestants, known as the jury, come back to vote for the person who they think should win the game.  The winner is the one who gets the most jury votes (this data is represented in the "Jury Votes" spreadsheet).  This person is known as the Sole Survivor.
- Voting recap: 
    - Tribal Council votes (Vote History spreadsheet) are bad; contestants with the most votes get sent home
    - Jury Votes (Jury Votes spreadsheet) are good; contestants with the most votes win the game and is the Sole Survivor

## Note

<u>Show Work</u>

Remember that you must show your work.  Students submissions are spot checked manually to verify that they are not hard coding the answer from looking only in the file or in CodeGrade's expected output.  If this is seen, the student's answer will be manually marked wrong and their grade will be changed to reflect this.

For example, if the question is who is the contestant who has received the most tribal votes to be voted out.  Select their record from the `castaway_details` DataFrame.  

You would show your work and code similar to this:
```
### incorrect way ###
Q1 = castaway_details[castaway_details['castaway_id'] ==  333]

### correct way - showing your work ###
# get index
idx = vote_history.groupby('vote_id').size().sort_values(ascending=False).index[0]

# select row based on index 
Q1 = castaway_details[castaway_details['castaway_id'] ==  idx]
```

<u>Use Copy</u>

Don't change any of the original DataFrames unless specifically asked or CodeGrade will not work correctly for this assignment.  Make sure you use `copy()` if needed.

In [1]:
# standard imports
import pandas as pd
import numpy as np

# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', None)

First, import the data from the `survivor.xlsx` file, calling the respective DataFrames the same as the sheet name but with lowercase and [snake case](https://en.wikipedia.org/wiki/Snake_case).  For example, the sheet called `Castaway Details` should be saved as a DataFrame called `castaway_details`.  Make sure that the data files are in the same folder as your notebook.

Note:  You may or may not need to install [openpyxl](https://openpyxl.readthedocs.io/en/stable/) for the code below to work.  You can use: `$ pip install openpyxl`

In [2]:
# import data from Excel

# setup Filename and Object
fileName = "survivor.xlsx"
xls = pd.ExcelFile(fileName)

# import individual sheets
castaway_details = pd.read_excel(xls, 'Castaway Details')
castaways = pd.read_excel(xls, 'Castaways')
challenge_description = pd.read_excel(xls, 'Challenge Description')
challenge_results = pd.read_excel(xls, 'Challenge Results')
confessionals = pd.read_excel(xls, 'Confessionals')
hidden_idols = pd.read_excel(xls, 'Hidden Idols')
jury_votes = pd.read_excel(xls, 'Jury Votes')
tribe_mapping = pd.read_excel(xls, 'Tribe Mapping')
viewers = pd.read_excel(xls, 'Viewers')
vote_history = pd.read_excel(xls, 'Vote History')
season_summary = pd.read_excel(xls, 'Season Summary')
season_palettes = pd.read_excel(xls, 'Season Palettes')
tribe_colours = pd.read_excel(xls, 'Tribe Colours')

**Exercise1:** Change every column name of every DataFrame to lowercase and snake case.  This is a standard first step for some programmers as lowercase makes it easier to write and snake case makes it easier to copy multiple-word column names.

For example, `Castaway Id` should end up being `castaway_id`.  You should try doing this using a `for` loop instead of manually changing the names for each column.  It should take you no more than a few lines of code.  Use stackoverflow if you need help.

In [3]:
castaway_details.head()

Unnamed: 0,Castaway Id,Full Name,Short Name,Date of Birth,Date of Death,Gender,Race,Ethnicity,Occupation,Personality Type
0,1,Sonja Christopher,Sonja,1937-01-28,NaT,Female,,,Musician,ENFP
1,2,B.B. Andersen,B.B.,1936-01-18,2013-10-29,Male,,,Real Estate Developer,ESTJ
2,3,Stacey Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,ENTJ
3,4,Ramona Gray,Ramona,1971-01-20,NaT,Female,Black,,Biochemist/Chemist,ISTJ
4,5,Dirk Been,Dirk,1976-06-15,NaT,Male,,,Dairy Farmer,ISFP


In [4]:
df_list = [castaway_details, castaways, challenge_description, challenge_results, confessionals, hidden_idols, jury_votes, tribe_mapping, viewers, vote_history, season_summary, season_palettes, tribe_colours ]

for df in df_list:
    for col in df.columns:
        df.columns = df.columns.str.lower()
        df.columns = df.columns.str.replace(' ', '_')

In [5]:
castaway_details.columns

Index(['castaway_id', 'full_name', 'short_name', 'date_of_birth',
       'date_of_death', 'gender', 'race', 'ethnicity', 'occupation',
       'personality_type'],
      dtype='object')

In [6]:
castaways.columns

Index(['season_name', 'season', 'full_name', 'castaway_id', 'castaway', 'age',
       'city', 'state', 'personality_type', 'episode', 'day', 'order',
       'result', 'jury_status', 'original_tribe', 'swapped_tribe',
       'swapped_tribe_2', 'merged_tribe', 'total_votes_received',
       'immunity_idols_won'],
      dtype='object')

**Q2:** What contestant was the oldest at the time of their season?  We want to look at their age at the time of the season and NOT their current age.  Select their row from the `castaway_details` DataFrame and save this as `Q2`.  This should return a DataFrame and the index and missing values should be left as is.

In [7]:
Q2 = castaway_details[castaway_details['date_of_birth'] == castaway_details.date_of_birth.min()]
Q2

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
13,14,Rudy Boesch,Rudy,1928-01-20,2019-11-01,Male,,,Retired Navy SEAL,ISTJ


**Q3:** What contestant played in the most number of seasons? Select their row from the `castaway_details` DataFrame and save this as `Q3`.  This should return a DataFrame and the index and missing values should be left as is.

In [8]:
Q3_df = castaways.drop_duplicates(subset=['full_name', 'season'])
Q3_df

Unnamed: 0,season_name,season,full_name,castaway_id,castaway,age,city,state,personality_type,episode,day,order,result,jury_status,original_tribe,swapped_tribe,swapped_tribe_2,merged_tribe,total_votes_received,immunity_idols_won
0,Survivor: 41,41,Erika Casupanan,594,Erika,32,Toronta,Ontario,ENFP,13,26,18,Sole Survivor,,Luvu,,,Via Kana,2,8
1,Survivor: 41,41,Deshawn Radden,601,Deshawn,26,Miami,Florida,ENTP,13,26,17,Runner-up,,Luvu,,,Via Kana,7,6
2,Survivor: 41,41,Xander Hastings,597,Xander,20,Chicago,Illinois,INFJ,13,26,16,2nd runner-up,,Yase,,,Via Kana,2,6
3,Survivor: 41,41,Heather Aldret,593,Heather,52,Charleston,South Carolina,ISFJ,13,25,15,15th voted out,8th jury member,Luvu,,,Via Kana,4,6
4,Survivor: 41,41,Ricard Foye,596,Ricard,31,Sedro-Woolley,Washington,ENTJ,13,24,14,14th voted out,7th jury member,Ua,,,Via Kana,9,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
757,Survivor: Borneo,1,Dirk Been,5,Dirk,23,Spring Green,Wisconsin,ISFP,5,15,5,5th voted out,,Tagi,,,,4,2
758,Survivor: Borneo,1,Ramona Gray,4,Ramona,29,Edison,New Jersey,ISTJ,4,12,4,4th voted out,,Pagong,,,,6,2
759,Survivor: Borneo,1,Stacey Stillman,3,Stacey,27,San Francisco,California,ENTJ,3,9,3,3rd voted out,,Tagi,,,,6,1
760,Survivor: Borneo,1,B.B. Andersen,2,B.B.,64,Mission Hills,Kansas,ESTJ,2,6,2,2nd voted out,,Pagong,,,,6,1


In [9]:
Q3_df.groupby('full_name')['season'].count().sort_values(ascending=False)

full_name
Rob Mariano       5
Cirie Fields      4
Oscar Lusth       4
Tyson Apostol     4
Rupert Boneham    4
                 ..
Holly Hoffman     1
Helen Glover      1
Heidi Strobel     1
Heather Aldret    1
Zoe Zanidakis     1
Name: season, Length: 611, dtype: int64

In [10]:
Q3 = castaway_details[castaway_details['full_name'] == 'Rob Mariano']
Q3

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
54,55,Rob Mariano,Boston Rob,1975-12-25,NaT,Male,,,Construction Worker,ESTJ


**Q4:** Create a DataFrame of all the contestants that won their season (aka their final result in the `castaways` DataFrame was the 'Sole Survivor').  Call this DataFrame `sole_survivor`.  Note that contestants may appear more than one time in this DataFrame if they won more than one season.  Make sure that the index goes from 0 to n-1 and that the DataFrame is sorted ascending by season number.

The DataFrame should have the same columns, and the columns should be in the same order, as the `castaways` DataFrame.

In [11]:
sole_survivor = castaways[castaways['result'] == 'Sole Survivor'].sort_values(by='season').reset_index(drop=True)
sole_survivor

Unnamed: 0,season_name,season,full_name,castaway_id,castaway,age,city,state,personality_type,episode,day,order,result,jury_status,original_tribe,swapped_tribe,swapped_tribe_2,merged_tribe,total_votes_received,immunity_idols_won
0,Survivor: Borneo,1,Richard Hatch,16,Richard,39,Newport,Rhode Island,ENTP,14,39,16,Sole Survivor,,Tagi,,,Rattana,6,4
1,Survivor: The Australian Outback,2,Tina Wesson,32,Tina,40,Knoxville,Tennessee,ESFJ,16,42,16,Sole Survivor,,Ogakor,,,Barramundi,0,2
2,Survivor: Africa,3,Ethan Zohn,48,Ethan,27,Lexington,Massachusetts,ISFP,15,39,16,Sole Survivor,,Boran,Boran,,Moto Maji,0,4
3,Survivor: Marquesas,4,Vecepia Towery,64,Vecepia,36,Hayward,California,ISTJ,15,39,16,Sole Survivor,,Maraamu,Rotu,,Soliantu,2,4
4,Survivor: Thailand,5,Brian Heidik,80,Brian,34,Quartz Hill,California,ISTP,15,39,16,Sole Survivor,,Chuay Gahn,,,Chuay Jai,0,8
5,Survivor: The Amazon,6,Jenna Morasca,96,Jenna,21,Bridgeville,Pennsylvania,ISTP,15,39,16,Sole Survivor,,Jaburu,Jaburu,,Jacaré,3,7
6,Survivor: Pearl Islands,7,Sandra Diaz-Twine,112,Sandra,29,Fort Lewis,Washington,ESTP,15,39,18,Sole Survivor,,Drake,Drake,,Balboa,0,3
7,Survivor: All-Stars,8,Amber Brkich,27,Amber,25,Beaver,Pennsylvania,ISFP,17,39,18,Sole Survivor,,Chapera,Chapera,Chapera,Chaboga Mogo,6,6
8,Survivor: Vanuatu,9,Chris Daugherty,130,Chris,33,South Vienna,Ohio,ENTP,15,39,18,Sole Survivor,,Lopevi,Lopevi,,Alinta,3,6
9,Survivor: Palau,10,Tom Westman,150,Tom,40,Sayville,New York,ESTJ,15,39,20,Sole Survivor,,Koror,,,Koror,0,12


**Q5:** Have any contestants won more than one time?  If so, select their records from the `sole_survivor` DataFrame, sorting the rows by season.  Save this as `Q5`.  If no contestant has won twice, save `Q5` as the string `None`.

In [12]:
sole_survivor.groupby('full_name')['season'].count().sort_values(ascending=False)[:2]

full_name
Tony Vlachos         2
Sandra Diaz-Twine    2
Name: season, dtype: int64

In [13]:
Q5 = sole_survivor[(sole_survivor['full_name'] == 'Tony Vlachos') | (sole_survivor['full_name'] == 'Sandra Diaz-Twine')].sort_values(by='season')
Q5

Unnamed: 0,season_name,season,full_name,castaway_id,castaway,age,city,state,personality_type,episode,day,order,result,jury_status,original_tribe,swapped_tribe,swapped_tribe_2,merged_tribe,total_votes_received,immunity_idols_won
6,Survivor: Pearl Islands,7,Sandra Diaz-Twine,112,Sandra,29,Fort Lewis,Washington,ESTP,15,39,18,Sole Survivor,,Drake,Drake,,Balboa,0,3
19,Survivor: Heroes vs. Villains,20,Sandra Diaz-Twine,112,Sandra,35,Fayetteville,North Carolina,ESTP,15,39,20,Sole Survivor,,Villains,,,Yin Yang,3,4
27,Survivor: Cagayan,28,Tony Vlachos,424,Tony,39,Jersey City,New Jersey,ESTP,14,39,18,Sole Survivor,,Aparri,Solana,,Solarrion,5,6
39,Survivor: Winners at War,40,Tony Vlachos,424,Tony,45,Allendale,New Jersey,ESTP,15,39,22,Sole Survivor,,Dakal,Dakal,,Koru,0,9


**Q6:** Using [value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html), what is the normalized relative frequencies (percentage) breakdown of gender for all the contestants?  Count someone who played in multiple seasons only once.  Round the results to 3 decimal places.  Save this as `Q6`.

In [14]:
Q6 = round(castaway_details['gender'].value_counts(normalize=True),3)
Q6

Male          0.502
Female        0.497
Non-binary    0.002
Name: gender, dtype: float64

**Q7:** 

- What percentage of times has a male won his season?  Save this percentage as `Q7A`.  
- What percentage of time has a female won her season?  Save this percentage as `Q7B`.  
- Note: Round all percentages to two decimal points and write as a float (example: `55.57`).
- Note 2: If a contestant has won twice, count each win separately.

In [15]:
df_joined = pd.merge(castaway_details, castaways, on='castaway_id')

In [16]:
(df_joined[df_joined['result'] == 'Sole Survivor'].value_counts('gender', normalize=True))*100

gender
Male      60.97561
Female    39.02439
dtype: float64

In [17]:
Q7A = round(60.97561,2)
Q7A

60.98

In [18]:
Q7B = round(39.02439,2)
Q7B

39.02

**Q8:** What is the average age of contestants when they appeared on the show?  Save this as `Q8`.  Round to nearest integer.

In [19]:
Q8 = round(castaways.age.mean())
Q8

33

**Q9:** Who played the most total number of days of Survivor? If a contestant appeared on more than one season, you would add their total days for each season together. Save the top five contestants in terms of total days played as a DataFrame and call it `Q9`, sorted in descending order by total days played.  

The following columns should be included: `castaway_id`, `full_name`, and `total_days_played` where `total_days_played` is the sum of all days a contestant played. The index should go from 0 to n-1.

Note:  Be careful because on some seasons, the contestant was allowed to come back into the game after being voted off.  Take a look at [Season 23's contestant Oscar Lusth](https://en.wikipedia.org/wiki/Ozzy_Lusth#South_Pacific) in the `castaways` DataFrame as an example.  He was voted out 7th and then returned to the game.  He was then voted out 9th and returned to the game a second time.  He was then voted out 17th the final time.  Be aware of this in your calculations and make sure you are counting the days according to the last time they were voted off or won. 

In [20]:
Q9_df = castaways.drop_duplicates(subset=['season', 'full_name'])

In [21]:
Q9 = Q9_df.groupby(['castaway_id', 'full_name'], as_index=False)['day'].sum().sort_values(by='day',ascending=False).reset_index(drop=True).head().rename({'day': 'total_days_played'}, axis=1)
Q9

Unnamed: 0,castaway_id,full_name,total_days_played
0,55,Rob Mariano,131
1,197,Parvati Shallow,130
2,201,Oscar Lusth,128
3,179,Cirie Fields,121
4,112,Sandra Diaz-Twine,110


**Q10A & Q10B**: Using the `castaway_details` data, what is the percentage of total extroverts and introverts that have played the game (count players only once even if they have played in more than one season).  Do not count contestants without a personality type listed in your calculations.  Save these percentages as `Q10A` and `Q10B` respectively.  Note: Round all percentages to two decimal points and write as a float (example: 55.57). 

For more information on personality types check this [Wikipedia article](https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator).

In [22]:
castaway_details2 = castaway_details.copy()

In [23]:
castaway_details2.personality_type.value_counts(normalize=True)

ENFP    0.107261
ESFP    0.097360
ISFP    0.084158
ESTP    0.079208
ISTJ    0.077558
INFP    0.072607
ESTJ    0.064356
INTP    0.059406
ISFJ    0.057756
ENTP    0.054455
ESFJ    0.046205
ISTP    0.044554
ENFJ    0.044554
ENTJ    0.042904
INFJ    0.034653
INTJ    0.033003
Name: personality_type, dtype: float64

In [24]:
castaway_details2.loc[castaway_details2['personality_type'].str.contains('E', case=False, na=False), 'personality_type' ] = 'extroverts'
castaway_details2.loc[castaway_details2['personality_type'].str.contains('I', case=False, na=False), 'personality_type' ] = 'introverts'

In [25]:
(castaway_details2.personality_type.value_counts(normalize=True))*100

extroverts    53.630363
introverts    46.369637
Name: personality_type, dtype: float64

In [26]:
Q10A = round(53.630363, 2)
Q10A

53.63

In [27]:
Q10B = round(46.369637, 2)
Q10B

46.37

**Q11A & Q11B**: Now that we know the percentages of total players that are extroverted and introverted, let's see if that made a difference in terms of who actually won their season.

What is the percentage of total extroverts and introverts that have won the game (count players only once even if they have won more than one season)?  Save these percentages as `Q11A` and `Q11B` respectively.  Note: Round all percentages to two decimal points and write as a float (example: 55.57).

In [28]:
sole_survivor

Unnamed: 0,season_name,season,full_name,castaway_id,castaway,age,city,state,personality_type,episode,day,order,result,jury_status,original_tribe,swapped_tribe,swapped_tribe_2,merged_tribe,total_votes_received,immunity_idols_won
0,Survivor: Borneo,1,Richard Hatch,16,Richard,39,Newport,Rhode Island,ENTP,14,39,16,Sole Survivor,,Tagi,,,Rattana,6,4
1,Survivor: The Australian Outback,2,Tina Wesson,32,Tina,40,Knoxville,Tennessee,ESFJ,16,42,16,Sole Survivor,,Ogakor,,,Barramundi,0,2
2,Survivor: Africa,3,Ethan Zohn,48,Ethan,27,Lexington,Massachusetts,ISFP,15,39,16,Sole Survivor,,Boran,Boran,,Moto Maji,0,4
3,Survivor: Marquesas,4,Vecepia Towery,64,Vecepia,36,Hayward,California,ISTJ,15,39,16,Sole Survivor,,Maraamu,Rotu,,Soliantu,2,4
4,Survivor: Thailand,5,Brian Heidik,80,Brian,34,Quartz Hill,California,ISTP,15,39,16,Sole Survivor,,Chuay Gahn,,,Chuay Jai,0,8
5,Survivor: The Amazon,6,Jenna Morasca,96,Jenna,21,Bridgeville,Pennsylvania,ISTP,15,39,16,Sole Survivor,,Jaburu,Jaburu,,Jacaré,3,7
6,Survivor: Pearl Islands,7,Sandra Diaz-Twine,112,Sandra,29,Fort Lewis,Washington,ESTP,15,39,18,Sole Survivor,,Drake,Drake,,Balboa,0,3
7,Survivor: All-Stars,8,Amber Brkich,27,Amber,25,Beaver,Pennsylvania,ISFP,17,39,18,Sole Survivor,,Chapera,Chapera,Chapera,Chaboga Mogo,6,6
8,Survivor: Vanuatu,9,Chris Daugherty,130,Chris,33,South Vienna,Ohio,ENTP,15,39,18,Sole Survivor,,Lopevi,Lopevi,,Alinta,3,6
9,Survivor: Palau,10,Tom Westman,150,Tom,40,Sayville,New York,ESTJ,15,39,20,Sole Survivor,,Koror,,,Koror,0,12


In [29]:
Q11_df = pd.merge(castaway_details2, sole_survivor, on='castaway_id')
Q11_df

Unnamed: 0,castaway_id,full_name_x,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type_x,season_name,season,full_name_y,castaway,age,city,state,personality_type_y,episode,day,order,result,jury_status,original_tribe,swapped_tribe,swapped_tribe_2,merged_tribe,total_votes_received,immunity_idols_won
0,16,Richard Hatch,Richard,1961-04-08,NaT,Male,,,Corporate Trainer,extroverts,Survivor: Borneo,1,Richard Hatch,Richard,39,Newport,Rhode Island,ENTP,14,39,16,Sole Survivor,,Tagi,,,Rattana,6,4
1,27,Amber Mariano,Amber,1978-08-11,NaT,Female,,,Administrative Assistant;Director of Marketing...,introverts,Survivor: All-Stars,8,Amber Brkich,Amber,25,Beaver,Pennsylvania,ISFP,17,39,18,Sole Survivor,,Chapera,Chapera,Chapera,Chaboga Mogo,6,6
2,32,Tina Wesson,Tina,1960-12-26,NaT,Female,,,Personal Nurse;Motivational Speaker,extroverts,Survivor: The Australian Outback,2,Tina Wesson,Tina,40,Knoxville,Tennessee,ESFJ,16,42,16,Sole Survivor,,Ogakor,,,Barramundi,0,2
3,48,Ethan Zohn,Ethan,1973-11-12,NaT,Male,White,Jewish,Professional Soccer Player;Social Entrepreneur...,introverts,Survivor: Africa,3,Ethan Zohn,Ethan,27,Lexington,Massachusetts,ISFP,15,39,16,Sole Survivor,,Boran,Boran,,Moto Maji,0,4
4,55,Rob Mariano,Boston Rob,1975-12-25,NaT,Male,,,Construction Worker,extroverts,Survivor: Redemption Island,22,Rob Mariano,Boston Rob,34,Pensacola,Florida,ESTJ,15,39,20,Sole Survivor,,Ometepe,,,Murlonio,7,9
5,64,Vecepia Towery,Vecepia,1965-12-09,NaT,Female,Black,,Office Manager,introverts,Survivor: Marquesas,4,Vecepia Towery,Vecepia,36,Hayward,California,ISTJ,15,39,16,Sole Survivor,,Maraamu,Rotu,,Soliantu,2,4
6,80,Brian Heidik,Brian,1968-03-09,NaT,Male,,,Used Car Salesman,introverts,Survivor: Thailand,5,Brian Heidik,Brian,34,Quartz Hill,California,ISTP,15,39,16,Sole Survivor,,Chuay Gahn,,,Chuay Jai,0,8
7,96,Jenna Morasca,Jenna M.,1981-02-15,NaT,Female,,,Swimsuit Model,introverts,Survivor: The Amazon,6,Jenna Morasca,Jenna,21,Bridgeville,Pennsylvania,ISTP,15,39,16,Sole Survivor,,Jaburu,Jaburu,,Jacaré,3,7
8,112,Sandra Diaz-Twine,Sandra,1974-07-30,NaT,Female,,Hispanic or Latino,Office Assistant;Case Manager,extroverts,Survivor: Pearl Islands,7,Sandra Diaz-Twine,Sandra,29,Fort Lewis,Washington,ESTP,15,39,18,Sole Survivor,,Drake,Drake,,Balboa,0,3
9,112,Sandra Diaz-Twine,Sandra,1974-07-30,NaT,Female,,Hispanic or Latino,Office Assistant;Case Manager,extroverts,Survivor: Heroes vs. Villains,20,Sandra Diaz-Twine,Sandra,35,Fayetteville,North Carolina,ESTP,15,39,20,Sole Survivor,,Villains,,,Yin Yang,3,4


In [30]:
Q11_df = Q11_df.drop_duplicates(subset=['full_name_x'])
Q11_df

Unnamed: 0,castaway_id,full_name_x,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type_x,season_name,season,full_name_y,castaway,age,city,state,personality_type_y,episode,day,order,result,jury_status,original_tribe,swapped_tribe,swapped_tribe_2,merged_tribe,total_votes_received,immunity_idols_won
0,16,Richard Hatch,Richard,1961-04-08,NaT,Male,,,Corporate Trainer,extroverts,Survivor: Borneo,1,Richard Hatch,Richard,39,Newport,Rhode Island,ENTP,14,39,16,Sole Survivor,,Tagi,,,Rattana,6,4
1,27,Amber Mariano,Amber,1978-08-11,NaT,Female,,,Administrative Assistant;Director of Marketing...,introverts,Survivor: All-Stars,8,Amber Brkich,Amber,25,Beaver,Pennsylvania,ISFP,17,39,18,Sole Survivor,,Chapera,Chapera,Chapera,Chaboga Mogo,6,6
2,32,Tina Wesson,Tina,1960-12-26,NaT,Female,,,Personal Nurse;Motivational Speaker,extroverts,Survivor: The Australian Outback,2,Tina Wesson,Tina,40,Knoxville,Tennessee,ESFJ,16,42,16,Sole Survivor,,Ogakor,,,Barramundi,0,2
3,48,Ethan Zohn,Ethan,1973-11-12,NaT,Male,White,Jewish,Professional Soccer Player;Social Entrepreneur...,introverts,Survivor: Africa,3,Ethan Zohn,Ethan,27,Lexington,Massachusetts,ISFP,15,39,16,Sole Survivor,,Boran,Boran,,Moto Maji,0,4
4,55,Rob Mariano,Boston Rob,1975-12-25,NaT,Male,,,Construction Worker,extroverts,Survivor: Redemption Island,22,Rob Mariano,Boston Rob,34,Pensacola,Florida,ESTJ,15,39,20,Sole Survivor,,Ometepe,,,Murlonio,7,9
5,64,Vecepia Towery,Vecepia,1965-12-09,NaT,Female,Black,,Office Manager,introverts,Survivor: Marquesas,4,Vecepia Towery,Vecepia,36,Hayward,California,ISTJ,15,39,16,Sole Survivor,,Maraamu,Rotu,,Soliantu,2,4
6,80,Brian Heidik,Brian,1968-03-09,NaT,Male,,,Used Car Salesman,introverts,Survivor: Thailand,5,Brian Heidik,Brian,34,Quartz Hill,California,ISTP,15,39,16,Sole Survivor,,Chuay Gahn,,,Chuay Jai,0,8
7,96,Jenna Morasca,Jenna M.,1981-02-15,NaT,Female,,,Swimsuit Model,introverts,Survivor: The Amazon,6,Jenna Morasca,Jenna,21,Bridgeville,Pennsylvania,ISTP,15,39,16,Sole Survivor,,Jaburu,Jaburu,,Jacaré,3,7
8,112,Sandra Diaz-Twine,Sandra,1974-07-30,NaT,Female,,Hispanic or Latino,Office Assistant;Case Manager,extroverts,Survivor: Pearl Islands,7,Sandra Diaz-Twine,Sandra,29,Fort Lewis,Washington,ESTP,15,39,18,Sole Survivor,,Drake,Drake,,Balboa,0,3
10,130,Chris Daugherty,Chris,1970-08-29,NaT,Male,,,Highway Construction Worker,extroverts,Survivor: Vanuatu,9,Chris Daugherty,Chris,33,South Vienna,Ohio,ENTP,15,39,18,Sole Survivor,,Lopevi,Lopevi,,Alinta,3,6


In [31]:
(Q11_df.personality_type_x.value_counts(normalize=True))*100

extroverts    61.538462
introverts    38.461538
Name: personality_type_x, dtype: float64

In [32]:
Q11A = round(61.538462, 2)
Q11A

61.54

In [33]:
Q11B = round(38.461538, 2)
Q11B

38.46

**Q12:** Which contestants have never received a tribal council vote (i.e. a vote to be voted out of the game as shown in the `vote_id` column in the `vote_history` DataFrame)? Note that there are various reasons for a contestant to not receive a tribal vote: they quit, made it to the end, medical emergency, etc.  Select their rows from the `castaway_details` DataFrame and save this as `Q12` in ascending order by `castaway_id`.  This should return a DataFrame and the index and missing values should be left as is.

In [126]:
Q12_df = pd.merge(castaway_details, vote_history, on='castaway_id')
Q12_df

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type,season_name,season,episode,day,tribe_status,castaway,immunity,vote,nullified,voted_out,order,vote_order,vote_id,voted_out_id
0,1,Sonja Christopher,Sonja,1937-01-28,NaT,Female,,,Musician,ENFP,Survivor: Borneo,1,1,3,Original,Sonja,,Rudy,False,Sonja,1,1,14.0,1.0
1,2,B.B. Andersen,B.B.,1936-01-18,2013-10-29,Male,,,Real Estate Developer,ESTJ,Survivor: Borneo,1,2,6,Original,B.B.,,Ramona,False,B.B.,2,1,4.0,2.0
2,3,Stacey Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,ENTJ,Survivor: Borneo,1,1,3,Original,Stacey,,Rudy,False,Sonja,1,1,14.0,1.0
3,3,Stacey Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,ENTJ,Survivor: Borneo,1,3,9,Original,Stacey,,Rudy,False,Stacey,3,1,14.0,3.0
4,4,Ramona Gray,Ramona,1971-01-20,NaT,Female,Black,,Biochemist/Chemist,ISTJ,Survivor: Borneo,1,2,6,Original,Ramona,,B.B.,False,B.B.,2,1,2.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4746,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,8,16,Merged,Liana,,Tiffany,False,Tiffany,8,1,604.0,604.0
4747,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,9,17,Merged,Liana,,Evvie,False,Evvie,10,1,598.0,598.0
4748,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,10,19,Merged,Liana,,Erika,False,Tie,11,1,594.0,
4749,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,10,19,Merged,Liana,,,False,Shan,11,1,,606.0


In [127]:
Q12_df = Q12_df[vote_history['vote_id'].isnull() ]
Q12_df

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type,season_name,season,episode,day,tribe_status,castaway,immunity,vote,nullified,voted_out,order,vote_order,vote_id,voted_out_id
16,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP,Survivor: Borneo,1,2,6,Original,Greg,,Ramona,False,B.B.,2,1,4.0,2.0
17,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP,Survivor: Borneo,1,4,12,Original,Greg,,Jenna,False,Ramona,4,1,9.0,4.0
41,10,Gervase Peterson,Gervase,1969-11-02,NaT,Male,Black,,YMCA Basketball Coach;Cigar Lounge Owner,ESTP,Survivor: Blood vs. Water,27,10,26,Merged,Gervase,,Laura M.,False,Laura M.,14,1,292.0,292.0
60,11,Colleen Haskell,Colleen,1976-12-06,NaT,Female,,,Student,INFP,Survivor: Borneo,1,8,24,Merged,Colleen,,Jenna,False,Greg,8,1,9.0,8.0
61,11,Colleen Haskell,Colleen,1976-12-06,NaT,Female,,,Student,INFP,Survivor: Borneo,1,9,27,Merged,Colleen,,Richard,False,Jenna,9,1,16.0,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4652,594,Erika Casupanan,Erika,1989-07-20,NaT,Female,Asian,,Communications Manager,ENFP,Survivor: 41,41,9,17,Merged,Erika,Individual,Naseer,False,Naseer,9,2,600.0,600.0
4745,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,7,14,Merged,Liana,Individual,Sydney,False,Sydney,7,1,605.0,605.0
4747,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,9,17,Merged,Liana,,Evvie,False,Evvie,10,1,598.0,598.0
4749,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,10,19,Merged,Liana,,,False,Shan,11,1,,606.0


In [133]:
Q12 = Q12_df[['castaway_id', 'full_name', 'short_name', 'date_of_birth',
       'date_of_death', 'gender', 'race', 'ethnicity', 'occupation',
       'personality_type']].sort_values(by='castaway_id')

In [134]:
Q12

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
16,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP
17,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP
41,10,Gervase Peterson,Gervase,1969-11-02,NaT,Male,Black,,YMCA Basketball Coach;Cigar Lounge Owner,ESTP
60,11,Colleen Haskell,Colleen,1976-12-06,NaT,Female,,,Student,INFP
61,11,Colleen Haskell,Colleen,1976-12-06,NaT,Female,,,Student,INFP
...,...,...,...,...,...,...,...,...,...,...
4652,594,Erika Casupanan,Erika,1989-07-20,NaT,Female,Asian,,Communications Manager,ENFP
4747,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ
4749,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ
4745,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ


In [132]:
idx = castaway_details.columns
idx

Index(['castaway_id', 'full_name', 'short_name', 'date_of_birth',
       'date_of_death', 'gender', 'race', 'ethnicity', 'occupation',
       'personality_type'],
      dtype='object')

**Q13:** What contestant has won the most number of challenges?  Select their row from the `castaway_details` DataFrame and save this as `Q13`.  This should return a DataFrame and the index and missing values should be left as is.

In [39]:
challenge_results

Unnamed: 0,season_name,season,episode,day,episode_title,challenge_name,challenge_type,outcome_type,challenge_id,winner_id,winner,winning_tribe,outcome_status
0,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,2.0,B.B.,Pagong,Winner
1,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,4.0,Ramona,Pagong,Winner
2,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,6.0,Joel,Pagong,Winner
3,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,7.0,Gretchen,Pagong,Winner
4,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,8.0,Greg,Pagong,Winner
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4436,Survivor: 41,41,12,23,Truth Kamikaze,Dizzy Miss Lizzy,Immunity,Individual,CH1015,596.0,Ricard,Via Kana,Winner
4437,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Reward,Individual,CH0954,594.0,Erika,Via Kana,Winner
4438,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Reward,Individual,CH0954,593.0,Heather,Via Kana,Chosen to participate
4439,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Immunity,Individual,CH0954,594.0,Erika,Via Kana,Winner


In [40]:
challenge_results_winner = challenge_results[challenge_results['outcome_status'] == 'Winner']
challenge_results_winner

Unnamed: 0,season_name,season,episode,day,episode_title,challenge_name,challenge_type,outcome_type,challenge_id,winner_id,winner,winning_tribe,outcome_status
0,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,2.0,B.B.,Pagong,Winner
1,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,4.0,Ramona,Pagong,Winner
2,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,6.0,Joel,Pagong,Winner
3,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,7.0,Gretchen,Pagong,Winner
4,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,8.0,Greg,Pagong,Winner
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4435,Survivor: 41,41,12,22,Truth Kamikaze,Beach Front Bay Bash,Reward,Team,CH1017,594.0,Erika,Via Kana,Winner
4436,Survivor: 41,41,12,23,Truth Kamikaze,Dizzy Miss Lizzy,Immunity,Individual,CH1015,596.0,Ricard,Via Kana,Winner
4437,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Reward,Individual,CH0954,594.0,Erika,Via Kana,Winner
4439,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Immunity,Individual,CH0954,594.0,Erika,Via Kana,Winner


In [41]:
challenge_results_winner.groupby('winner_id')['outcome_status'].count().sort_values(ascending=False).head(1)

winner_id
201.0    42
Name: outcome_status, dtype: int64

In [42]:
challenge_results[challenge_results['winner_id'] == 201.0]

Unnamed: 0,season_name,season,episode,day,episode_title,challenge_name,challenge_type,outcome_type,challenge_id,winner_id,winner,winning_tribe,outcome_status
1187,Survivor: Cook Islands,13,1,3,I Can Forgive Her But I Don't Have To Because ...,"Lock, Load and Light",Immunity,Tribal,CH0517,201.0,Ozzy,Aitutaki,Winner
1192,Survivor: Cook Islands,13,1,3,I Can Forgive Her But I Don't Have To Because ...,"Lock, Load and Light",Immunity,Tribal,CH0517,201.0,Ozzy,Roratonga,Winner
1232,Survivor: Cook Islands,13,4,11,Ruling the Roost,Rescue Mission,Immunity,Tribal,CH0021,201.0,Ozzy,Aitutaki,Winner
1240,Survivor: Cook Islands,13,4,11,Ruling the Roost,Sacrificial Lamb,Reward,Tribal,CH0406,201.0,Ozzy,Aitutaki,Winner
1248,Survivor: Cook Islands,13,5,14,Don't Cry Over Spilled Octopus,United We Stand,Immunity,Tribal,CH0520,201.0,Ozzy,Aitutaki,Winner
1264,Survivor: Cook Islands,13,6,15,Plan Voodoo,Kicking and Screaming,Reward,Tribal,CH0599,201.0,Ozzy,Aitutaki,Winner
1277,Survivor: Cook Islands,13,8,18,Why Aren't You Swimming?!,Smash and Grab,Reward,Tribal,CH0524,201.0,Ozzy,Aitutaki,Winner
1281,Survivor: Cook Islands,13,9,21,Mutiny,Depth Charge,Immunity,Tribal,CH0536,201.0,Ozzy,Aitutaki,Winner
1285,Survivor: Cook Islands,13,9,21,Mutiny,Barrel of Monkeys,Reward,Tribal,CH0531,201.0,Ozzy,Aitutaki,Winner
1289,Survivor: Cook Islands,13,10,24,People That You Like Want To See You Suffer,South Pacific,Immunity,Tribal,CH0537,201.0,Ozzy,Aitutaki,Winner


In [43]:
Q13 = castaway_details[castaway_details['short_name'].str.contains('Ozzy')]
Q13

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
200,201,Oscar Lusth,Ozzy,1981-08-23,NaT,Male,Mexican American,Hispanic or Latino,Waiter;Photographer,ISFP


**Q14:** Let's see if the use of hidden immunity idols has increased or decreased over the seasons.   Create a Series of the number of hidden idols held per season.  The season number should be the index and the values should be the sum of the number of idols that were held.  Save this as `Q14`, sorted by season in ascending order.

In [115]:
hidden_idols2 = hidden_idols.set_index('season')
groupby_season = hidden_idols2.groupby('season')
Q14 = groupby_season.idols_held.sum().sort_index()
Q14

season
11     1
12     1
13     1
14     4
15     3
16     4
17     4
18     3
19     4
20    11
21     4
22     3
23     2
24     3
25     3
26     7
27     3
28     6
29     5
30     3
31     4
32     5
33     7
34     8
35    10
36     9
37     7
38     8
39    13
40    10
Name: idols_held, dtype: int64

**Q15:** Which contestant held the most number of hidden immunity idols in a single season?  Select their row from the `castaway_details` DataFrame and save this as `Q15`.  This should return a DataFrame and the index and missing values should be left as is. 

In [45]:
hidden_idols2

Unnamed: 0_level_0,season_name,castaway_id,castaway,idol_number,idols_held,votes_nullified,day_found,day_played,legacy_advantage
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
11,Survivor: Guatemala,161,Gary,1,1,0.0,24.0,,False
12,Survivor: Panama,180,Terry,1,1,0.0,,,False
13,Survivor: Cook Islands,202,Yul,1,1,0.0,,,False
14,Survivor: Fiji,218,Yau-Man,1,1,4.0,17.0,36.0,False
14,Survivor: Fiji,214,Mookie,1,1,0.0,20.0,,False
...,...,...,...,...,...,...,...,...,...
40,Survivor: Winners at War,478,Michele,1,1,2.0,22.0,31.0,False
40,Survivor: Winners at War,424,Tony,1,1,0.0,28.0,36.0,False
40,Survivor: Winners at War,516,Ben,1,1,2.0,29.0,36.0,False
40,Survivor: Winners at War,442,Natalie,1,1,4.0,35.0,36.0,False


In [46]:
hidden_idols.groupby(['castaway_id', 'season']).sum('idols_held').sort_values(by='idols_held', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,idol_number,idols_held,votes_nullified,day_found,day_played,legacy_advantage
castaway_id,season,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
560,38,10,4,8.0,117.0,127.0,0
424,28,6,3,0.0,63.0,55.0,0
578,39,6,3,5.0,47.0,19.0,0
516,35,6,3,9.0,100.0,106.0,0
300,19,6,3,7.0,51.0,45.0,0
...,...,...,...,...,...,...,...
347,23,1,0,0.0,0.0,0.0,0
370,24,1,0,0.0,2.0,0.0,0
246,26,1,0,0.0,31.0,0.0,0
181,20,1,0,0.0,0.0,0.0,0


In [47]:
Q15 = castaway_details[castaway_details['castaway_id'] == 560]
Q15

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
559,560,Rick Devens,Rick,1984-04-05,NaT,Male,,,Morning News Anchor,ENTP


**Q16:** What was the largest number of days between when a hidden immunity idol was found and played. Don't count instances with missing values in days found or the days played column.  Save the largest number of days as `Q16` (as an int).

In [48]:
hidden_idols

Unnamed: 0,season_name,season,castaway_id,castaway,idol_number,idols_held,votes_nullified,day_found,day_played,legacy_advantage
0,Survivor: Guatemala,11,161,Gary,1,1,0.0,24.0,,False
1,Survivor: Panama,12,180,Terry,1,1,0.0,,,False
2,Survivor: Cook Islands,13,202,Yul,1,1,0.0,,,False
3,Survivor: Fiji,14,218,Yau-Man,1,1,4.0,17.0,36.0,False
4,Survivor: Fiji,14,214,Mookie,1,1,0.0,20.0,,False
...,...,...,...,...,...,...,...,...,...,...
154,Survivor: Winners at War,40,478,Michele,1,1,2.0,22.0,31.0,False
155,Survivor: Winners at War,40,424,Tony,1,1,0.0,28.0,36.0,False
156,Survivor: Winners at War,40,516,Ben,1,1,2.0,29.0,36.0,False
157,Survivor: Winners at War,40,442,Natalie,1,1,4.0,35.0,36.0,False


In [49]:
hidden_idols['day_diff'] = hidden_idols['day_played'] - hidden_idols['day_found']
hidden_idols['day_diff'].sort_values(ascending=False)

113    33.0
76     33.0
128    32.0
94     31.0
95     27.0
       ... 
141     NaN
142     NaN
148     NaN
149     NaN
152     NaN
Name: day_diff, Length: 159, dtype: float64

In [50]:
Q16 = 33
Q16

33

**Q17:** Let's see how many winners ended up getting unanimous jury votes to win the game.  Create a Dataframe that shows the survivors that got unanimous jury votes with these columns in the final output: `season`, `season_name`, `winner_id`, `full_name`.  The DataFrame should be sorted by season and the index should go from 0 to n-1.  Save this as `Q17`. 

In [51]:
challenge_results

Unnamed: 0,season_name,season,episode,day,episode_title,challenge_name,challenge_type,outcome_type,challenge_id,winner_id,winner,winning_tribe,outcome_status
0,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,2.0,B.B.,Pagong,Winner
1,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,4.0,Ramona,Pagong,Winner
2,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,6.0,Joel,Pagong,Winner
3,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,7.0,Gretchen,Pagong,Winner
4,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,8.0,Greg,Pagong,Winner
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4436,Survivor: 41,41,12,23,Truth Kamikaze,Dizzy Miss Lizzy,Immunity,Individual,CH1015,596.0,Ricard,Via Kana,Winner
4437,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Reward,Individual,CH0954,594.0,Erika,Via Kana,Winner
4438,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Reward,Individual,CH0954,593.0,Heather,Via Kana,Chosen to participate
4439,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Immunity,Individual,CH0954,594.0,Erika,Via Kana,Winner


In [52]:
jury_votes

Unnamed: 0,season_name,season,castaway,finalist,vote,castaway_id,finalist_id
0,Survivor: 41,41,Heather,Deshawn,0,593,601
1,Survivor: 41,41,Ricard,Deshawn,0,596,601
2,Survivor: 41,41,Danny,Deshawn,1,599,601
3,Survivor: 41,41,Liana,Deshawn,0,608,601
4,Survivor: 41,41,Shan,Deshawn,0,606,601
...,...,...,...,...,...,...,...
928,Survivor: Borneo,1,Greg,Richard,1,8,16
929,Survivor: Borneo,1,Jenna,Richard,0,9,16
930,Survivor: Borneo,1,Rudy,Richard,1,14,16
931,Survivor: Borneo,1,Sean,Richard,1,12,16


In [53]:
g = jury_votes.groupby(['season', 'finalist_id'])[['castaway_id']].count()
g

Unnamed: 0_level_0,Unnamed: 1_level_0,castaway_id
season,finalist_id,Unnamed: 2_level_1
1,15,7
1,16,7
2,31,7
2,32,7
3,47,7
...,...,...
40,442,16
40,478,16
41,594,8
41,597,8


In [54]:
j = jury_votes.groupby(['season', 'finalist_id' ])[['vote']].sum()
j

Unnamed: 0_level_0,Unnamed: 1_level_0,vote
season,finalist_id,Unnamed: 2_level_1
1,15,3
1,16,4
2,31,3
2,32,4
3,47,2
...,...,...
40,442,4
40,478,0
41,594,7
41,597,0


In [55]:
merged = pd.merge(j,g, left_on='finalist_id', right_on='finalist_id')
merged = merged.rename({'vote': 'summed_vote',
              'castaway_id': 'castaway_id_count'}, axis=1)
merged

Unnamed: 0_level_0,summed_vote,castaway_id_count
finalist_id,Unnamed: 1_level_1,Unnamed: 2_level_1
15,3,7
16,4,7
31,3,7
32,4,7
47,2,7
...,...,...
589,2,10
590,8,10
594,7,8
597,0,8


In [56]:
#finalists with unanimous votes, count of the voters = sum of the votes 
merged[merged.summed_vote == merged.castaway_id_count]

Unnamed: 0_level_0,summed_vote,castaway_id_count
finalist_id,Unnamed: 1_level_1,Unnamed: 2_level_1
221,9,9
281,7,7
348,8,8
433,10,10
498,10,10


In [57]:
jury_votes

Unnamed: 0,season_name,season,castaway,finalist,vote,castaway_id,finalist_id
0,Survivor: 41,41,Heather,Deshawn,0,593,601
1,Survivor: 41,41,Ricard,Deshawn,0,596,601
2,Survivor: 41,41,Danny,Deshawn,1,599,601
3,Survivor: 41,41,Liana,Deshawn,0,608,601
4,Survivor: 41,41,Shan,Deshawn,0,606,601
...,...,...,...,...,...,...,...
928,Survivor: Borneo,1,Greg,Richard,1,8,16
929,Survivor: Borneo,1,Jenna,Richard,0,9,16
930,Survivor: Borneo,1,Rudy,Richard,1,14,16
931,Survivor: Borneo,1,Sean,Richard,1,12,16


In [58]:
challenge_results

Unnamed: 0,season_name,season,episode,day,episode_title,challenge_name,challenge_type,outcome_type,challenge_id,winner_id,winner,winning_tribe,outcome_status
0,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,2.0,B.B.,Pagong,Winner
1,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,4.0,Ramona,Pagong,Winner
2,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,6.0,Joel,Pagong,Winner
3,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,7.0,Gretchen,Pagong,Winner
4,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,8.0,Greg,Pagong,Winner
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4436,Survivor: 41,41,12,23,Truth Kamikaze,Dizzy Miss Lizzy,Immunity,Individual,CH1015,596.0,Ricard,Via Kana,Winner
4437,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Reward,Individual,CH0954,594.0,Erika,Via Kana,Winner
4438,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Reward,Individual,CH0954,593.0,Heather,Via Kana,Chosen to participate
4439,Survivor: 41,41,13,24,One Thing Left to do… Win,Gimme Three Steps,Immunity,Individual,CH0954,594.0,Erika,Via Kana,Winner


In [59]:
castaway_details

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
0,1,Sonja Christopher,Sonja,1937-01-28,NaT,Female,,,Musician,ENFP
1,2,B.B. Andersen,B.B.,1936-01-18,2013-10-29,Male,,,Real Estate Developer,ESTJ
2,3,Stacey Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,ENTJ
3,4,Ramona Gray,Ramona,1971-01-20,NaT,Female,Black,,Biochemist/Chemist,ISTJ
4,5,Dirk Been,Dirk,1976-06-15,NaT,Male,,,Dairy Farmer,ISFP
...,...,...,...,...,...,...,...,...,...,...
603,604,Tiffany Seely,Tiffany,1973-12-08,NaT,Female,White,Jewish,Teacher,ENTP
604,605,Sydney Segal,Sydney,1995-07-19,NaT,Female,White,Jewish,Law Student,ESTP
605,606,Shantel Smith,Shan,1987-03-11,NaT,Female,Black,,Pastor,ENFJ
606,607,David Voce,Voce,1986-05-01,NaT,Male,,,Neurosurgeon,ENTJ


In [60]:
df_17 = pd.merge(castaway_details, jury_votes, left_on='castaway_id', right_on='castaway_id')
df_17

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type,season_name,season,castaway,finalist,vote,finalist_id
0,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP,Survivor: Borneo,1,Greg,Kelly,0,15
1,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP,Survivor: Borneo,1,Greg,Richard,1,16
2,9,Jenna Lewis,Jenna L.,1977-07-16,NaT,Female,,,Student,ENFP,Survivor: All-Stars,8,Jenna L.,Amber,0,27
3,9,Jenna Lewis,Jenna L.,1977-07-16,NaT,Female,,,Student,ENFP,Survivor: All-Stars,8,Jenna L.,Boston Rob,1,55
4,9,Jenna Lewis,Jenna L.,1977-07-16,NaT,Female,,,Student,ENFP,Survivor: Borneo,1,Jenna,Kelly,1,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
928,606,Shantel Smith,Shan,1987-03-11,NaT,Female,Black,,Pastor,ENFJ,Survivor: 41,41,Shan,Erika,1,594
929,606,Shantel Smith,Shan,1987-03-11,NaT,Female,Black,,Pastor,ENFJ,Survivor: 41,41,Shan,Xander,0,597
930,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,Liana,Deshawn,0,601
931,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,Liana,Erika,1,594


In [61]:
df_17 = pd.merge(df_17, challenge_results, left_on='short_name', right_on='winner')
df_17

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type,season_name_x,season_x,castaway,finalist,vote,finalist_id,season_name_y,season_y,episode,day,episode_title,challenge_name,challenge_type,outcome_type,challenge_id,winner_id,winner,winning_tribe,outcome_status
0,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP,Survivor: Borneo,1,Greg,Kelly,0,15,Survivor: Borneo,1,1,3,The Marooning,Quest for Fire,Reward and Immunity,Tribal,CH0001,8.0,Greg,Pagong,Winner
1,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP,Survivor: Borneo,1,Greg,Kelly,0,15,Survivor: Borneo,1,2,6,The Generation Gap,Shoulder the Load,Reward,Tribal,CH0357,8.0,Greg,Pagong,Winner
2,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP,Survivor: Borneo,1,Greg,Kelly,0,15,Survivor: Borneo,1,3,9,Quest for Food,Rescue Mission,Immunity,Tribal,CH0021,8.0,Greg,Pagong,Winner
3,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP,Survivor: Borneo,1,Greg,Kelly,0,15,Survivor: Borneo,1,5,15,Pulling Your Own Weight,Shipwrecked,Immunity,Tribal,CH0034,8.0,Greg,Pagong,Winner
4,8,Greg Buis,Greg,1975-12-31,NaT,Male,,,Ivy League Graduate,INTP,Survivor: Borneo,1,Greg,Kelly,0,15,Survivor: Borneo,1,5,15,Pulling Your Own Weight,Choose Your Weapon,Reward,Tribal,CH0029,8.0,Greg,Pagong,Winner
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15432,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,Liana,Xander,0,597,Survivor: 41,41,4,9,They Hate Me Because They Ain't Me,Kenny Log-Ins,Immunity,Tribal,CH0982,608.0,Liana,Yase,Winner
15433,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,Liana,Xander,0,597,Survivor: 41,41,4,9,They Hate Me Because They Ain't Me,Running Down a Dream,Reward,Tribal,CH1005,608.0,Liana,Yase,Winner
15434,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,Liana,Xander,0,597,Survivor: 41,41,5,11,The Strategist or the Loyalist,Losing Face,Reward and Immunity,Tribal,CH0765,608.0,Liana,Yase,Winner
15435,608,Liana Wallace,Liana,2000-10-25,NaT,Female,Black,Jewish,College Student,ESTJ,Survivor: 41,41,Liana,Xander,0,597,Survivor: 41,41,7,14,There's Gonna Be Blood,The Game is Afoot,Immunity,Tribal,CH0902,608.0,Liana,,Winner


In [62]:
df_17 = df_17[(df_17['finalist_id'] == 221) | (df_17['finalist_id'] == 281) | (df_17['finalist_id'] == 348) | (df_17['finalist_id'] == 433) | (df_17['finalist_id'] == 498)]
df_17 = df_17[['season_x', 'season_name_x', 'winner_id', 'full_name']]
df_17 = df_17.rename({'season_x': 'season',
             'season_name_x': 'season_name'}, axis=1)

In [63]:
Q17 = df_17.sort_values(by='season').drop_duplicates(subset='full_name').reset_index(drop=True)
Q17

Unnamed: 0,season,season_name,winner_id,full_name
0,14,Survivor: Fiji,216.0,Stacy Kimball
1,14,Survivor: Fiji,212.0,Michelle Yi
2,14,Survivor: Fiji,218.0,Yau-Man Chan
3,14,Survivor: Fiji,213.0,Edgardo Rivera
4,14,Survivor: Fiji,215.0,Alex Angarita
5,14,Survivor: Fiji,210.0,James Reid
6,14,Survivor: Fiji,211.0,Lisette Linares
7,14,Survivor: Fiji,214.0,Mookie Lee
8,14,Survivor: Fiji,217.0,Kenward Bernis
9,18,Survivor: Tocantins,456.0,Sierra Reed


**Q18:** Sometimes a contestant might win the game even though they have a lot of other contestants trying to eliminate them.    What survivor that won their season had the most votes against them to get voted out during the season (represented as "total_votes_received" from the `sole_survivor` DataFrame).  Select their row from the `castaway_details` DataFrame and save this as `Q18`. This should return a DataFrame and the index and missing values should be left as is.

In [64]:
sole_survivor.head()

Unnamed: 0,season_name,season,full_name,castaway_id,castaway,age,city,state,personality_type,episode,day,order,result,jury_status,original_tribe,swapped_tribe,swapped_tribe_2,merged_tribe,total_votes_received,immunity_idols_won
0,Survivor: Borneo,1,Richard Hatch,16,Richard,39,Newport,Rhode Island,ENTP,14,39,16,Sole Survivor,,Tagi,,,Rattana,6,4
1,Survivor: The Australian Outback,2,Tina Wesson,32,Tina,40,Knoxville,Tennessee,ESFJ,16,42,16,Sole Survivor,,Ogakor,,,Barramundi,0,2
2,Survivor: Africa,3,Ethan Zohn,48,Ethan,27,Lexington,Massachusetts,ISFP,15,39,16,Sole Survivor,,Boran,Boran,,Moto Maji,0,4
3,Survivor: Marquesas,4,Vecepia Towery,64,Vecepia,36,Hayward,California,ISTJ,15,39,16,Sole Survivor,,Maraamu,Rotu,,Soliantu,2,4
4,Survivor: Thailand,5,Brian Heidik,80,Brian,34,Quartz Hill,California,ISTP,15,39,16,Sole Survivor,,Chuay Gahn,,,Chuay Jai,0,8


In [65]:
sole_survivor.groupby(['castaway_id', 'season_name'])[['total_votes_received']].sum().sort_values(by = 'total_votes_received', ascending=False).head(1)


Unnamed: 0_level_0,Unnamed: 1_level_0,total_votes_received
castaway_id,season_name,Unnamed: 2_level_1
516,Survivor: Heroes vs. Healers vs. Hustlers,11


In [66]:
Q18 = castaway_details[castaway_details['castaway_id'] == 516]
Q18

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
515,516,Ben Driebergen,Ben,1983-01-01,NaT,Male,,,Marine;Real Estate/Stay-at-home Dad,ESFP


**Q19:** For the `castaway_details` DataFrame, there is a `full_name` column and a `short_name` column.  It would be helpful for future analysis to have the contestants first and last name split into separate columns.  First copy the `castaway_details` DataFrame to a new DataFrame called `Q19` so that we do not change the original DataFrame.  

Create two new columns and add the contestant's first name to a new column called `first_name` and their last name to a new column called `last_name`.  Add these columns to the `Q19` DataFrame.  Put the `first_name` and `last_name` columns between the `full_name` and `short_name` columns.

Note:  Be careful as some players have last names with multiple spaces.  For example, `Lex van den Berghe`.  You should code `Lex` as his first name and `van den Berghe` as his last name.

In [67]:
Q19 = castaway_details.copy()
Q19.head()

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
0,1,Sonja Christopher,Sonja,1937-01-28,NaT,Female,,,Musician,ENFP
1,2,B.B. Andersen,B.B.,1936-01-18,2013-10-29,Male,,,Real Estate Developer,ESTJ
2,3,Stacey Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,ENTJ
3,4,Ramona Gray,Ramona,1971-01-20,NaT,Female,Black,,Biochemist/Chemist,ISTJ
4,5,Dirk Been,Dirk,1976-06-15,NaT,Male,,,Dairy Farmer,ISFP


In [68]:
name = Q19['full_name'].str.split(expand=True, n=1)
name
#Q19['first_name'] = name[:0]

Unnamed: 0,0,1
0,Sonja,Christopher
1,B.B.,Andersen
2,Stacey,Stillman
3,Ramona,Gray
4,Dirk,Been
...,...,...
603,Tiffany,Seely
604,Sydney,Segal
605,Shantel,Smith
606,David,Voce


In [69]:
name[0]

0        Sonja
1         B.B.
2       Stacey
3       Ramona
4         Dirk
        ...   
603    Tiffany
604     Sydney
605    Shantel
606      David
607      Liana
Name: 0, Length: 608, dtype: object

In [70]:
Q19['first_name'] = name[0]
Q19['last_name'] = name[1]

In [71]:
Q19 = Q19[['castaway_id', 'full_name', 'first_name', 'last_name', 'short_name', 'date_of_birth', 'date_of_death', 'gender', 'race', 'ethnicity', 'occupation', 'personality_type']]
Q19

Unnamed: 0,castaway_id,full_name,first_name,last_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
0,1,Sonja Christopher,Sonja,Christopher,Sonja,1937-01-28,NaT,Female,,,Musician,ENFP
1,2,B.B. Andersen,B.B.,Andersen,B.B.,1936-01-18,2013-10-29,Male,,,Real Estate Developer,ESTJ
2,3,Stacey Stillman,Stacey,Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,ENTJ
3,4,Ramona Gray,Ramona,Gray,Ramona,1971-01-20,NaT,Female,Black,,Biochemist/Chemist,ISTJ
4,5,Dirk Been,Dirk,Been,Dirk,1976-06-15,NaT,Male,,,Dairy Farmer,ISFP
...,...,...,...,...,...,...,...,...,...,...,...,...
603,604,Tiffany Seely,Tiffany,Seely,Tiffany,1973-12-08,NaT,Female,White,Jewish,Teacher,ENTP
604,605,Sydney Segal,Sydney,Segal,Sydney,1995-07-19,NaT,Female,White,Jewish,Law Student,ESTP
605,606,Shantel Smith,Shantel,Smith,Shan,1987-03-11,NaT,Female,Black,,Pastor,ENFJ
606,607,David Voce,David,Voce,Voce,1986-05-01,NaT,Male,,,Neurosurgeon,ENTJ


In [72]:
#checking
Q19[Q19['first_name'] == 'Lex']

Unnamed: 0,castaway_id,full_name,first_name,last_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
45,46,Lex van den Berghe,Lex,van den Berghe,Lex,1963-06-18,NaT,Male,,,Marketing Manager,ENTJ


**Q20:** Let's say that we wanted to predict a contestants personality type based on the information in the data files.  Your task is to create a DataFrame that lists the `castaway_id`, `full_name` and `personality_type` for each castaway contestant.  However, since most machine learning algorithms use numeric data, you want to change the personality types to the following numbers:
- ISTJ - 1
- ISTP - 2
- ISFJ - 3
- ISFP - 4
- INFJ - 5
- INFP - 6
- INTJ - 7
- INTP - 8
- ESTP - 9
- ESTJ - 10
- ESFP - 11
- ESFJ - 12
- ENFP - 13
- ENFJ - 14
- ENTP - 15
- ENTJ - 16
- Missing values - 17

Save this new DataFrame as `Q20` and sort based on `castaway_id` in ascending order.

In [73]:
Q20_df = castaway_details.copy()

In [74]:
Q20_df[['castaway_id','full_name', 'personality_type' ]]

Unnamed: 0,castaway_id,full_name,personality_type
0,1,Sonja Christopher,ENFP
1,2,B.B. Andersen,ESTJ
2,3,Stacey Stillman,ENTJ
3,4,Ramona Gray,ISTJ
4,5,Dirk Been,ISFP
...,...,...,...
603,604,Tiffany Seely,ENTP
604,605,Sydney Segal,ESTP
605,606,Shantel Smith,ENFJ
606,607,David Voce,ENTJ


In [75]:
Q20_df[['personality_type']] = castaway_details[['personality_type']].replace({'ISTJ' : 1,
'ISTP' : 2,
'ISFJ' : 3,
'ISFP' : 4,
'INFJ' : 5,
'INFP' : 6,
'INTJ' : 7,
'INTP' : 8,
'ESTP' : 9,
'ESTJ' : 10,
'ESFP' : 11,
'ESFJ' : 12,
'ENFP' : 13,
'ENFJ' : 14,
'ENTP' : 15,
'ENTJ' :16
})

In [76]:
Q20_df[['personality_type']] = Q20_df[['personality_type']].fillna(17)
Q20_df[['personality_type']] = Q20_df[['personality_type']].astype(int).astype('category')

In [77]:
Q20 = Q20_df[['castaway_id','full_name', 'personality_type' ]].sort_values(by='castaway_id')
Q20

Unnamed: 0,castaway_id,full_name,personality_type
0,1,Sonja Christopher,13
1,2,B.B. Andersen,10
2,3,Stacey Stillman,16
3,4,Ramona Gray,1
4,5,Dirk Been,4
...,...,...,...
603,604,Tiffany Seely,15
604,605,Sydney Segal,9
605,606,Shantel Smith,14
606,607,David Voce,16


**Q21:** After thinking about this some more, you realize that you don't want to code the personality traits as you did in problem 20 since the data is not ordinal.  Some machine learning algorithms will assume that numbers close to each other are more alike than those that are away from each other and that is not the case with these personality types.

Let's create a new DataFrame called `Q21` that creates dummy columns (using `get_dummies`) for the original personality type column.  Add a prefix called "type" and drop the first column to help prevent multicollinearity.  The columns should be `castaway_id`, `full_name` followed by the various dummy columns for the personality types.  Don't worry about any missing values in this step.

Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment.  Make sure you use `copy()` if needed.

In [78]:
Q21_df = castaway_details.copy()
Q21_df

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
0,1,Sonja Christopher,Sonja,1937-01-28,NaT,Female,,,Musician,ENFP
1,2,B.B. Andersen,B.B.,1936-01-18,2013-10-29,Male,,,Real Estate Developer,ESTJ
2,3,Stacey Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,ENTJ
3,4,Ramona Gray,Ramona,1971-01-20,NaT,Female,Black,,Biochemist/Chemist,ISTJ
4,5,Dirk Been,Dirk,1976-06-15,NaT,Male,,,Dairy Farmer,ISFP
...,...,...,...,...,...,...,...,...,...,...
603,604,Tiffany Seely,Tiffany,1973-12-08,NaT,Female,White,Jewish,Teacher,ENTP
604,605,Sydney Segal,Sydney,1995-07-19,NaT,Female,White,Jewish,Law Student,ESTP
605,606,Shantel Smith,Shan,1987-03-11,NaT,Female,Black,,Pastor,ENFJ
606,607,David Voce,Voce,1986-05-01,NaT,Male,,,Neurosurgeon,ENTJ


In [79]:
dummies = pd.get_dummies(Q21_df, columns=['personality_type'], prefix='type', drop_first=True)
dummies

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,type_ENFP,type_ENTJ,type_ENTP,type_ESFJ,type_ESFP,type_ESTJ,type_ESTP,type_INFJ,type_INFP,type_INTJ,type_INTP,type_ISFJ,type_ISFP,type_ISTJ,type_ISTP
0,1,Sonja Christopher,Sonja,1937-01-28,NaT,Female,,,Musician,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,B.B. Andersen,B.B.,1936-01-18,2013-10-29,Male,,,Real Estate Developer,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,3,Stacey Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,Ramona Gray,Ramona,1971-01-20,NaT,Female,Black,,Biochemist/Chemist,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,5,Dirk Been,Dirk,1976-06-15,NaT,Male,,,Dairy Farmer,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,604,Tiffany Seely,Tiffany,1973-12-08,NaT,Female,White,Jewish,Teacher,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
604,605,Sydney Segal,Sydney,1995-07-19,NaT,Female,White,Jewish,Law Student,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
605,606,Shantel Smith,Shan,1987-03-11,NaT,Female,Black,,Pastor,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
606,607,David Voce,Voce,1986-05-01,NaT,Male,,,Neurosurgeon,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [80]:
Q21 = dummies.drop(['short_name', 'date_of_birth', 'date_of_death', 'gender', 'race', 'ethnicity','occupation'], axis=1)
Q21

Unnamed: 0,castaway_id,full_name,type_ENFP,type_ENTJ,type_ENTP,type_ESFJ,type_ESFP,type_ESTJ,type_ESTP,type_INFJ,type_INFP,type_INTJ,type_INTP,type_ISFJ,type_ISFP,type_ISTJ,type_ISTP
0,1,Sonja Christopher,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,B.B. Andersen,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,3,Stacey Stillman,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,Ramona Gray,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,5,Dirk Been,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,604,Tiffany Seely,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
604,605,Sydney Segal,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
605,606,Shantel Smith,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
606,607,David Voce,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


**Q22:** After running your data above through your machine learning model, you determine that a better prediction might come from breaking the personality type into its four parts (one part for each character in the type).  Your task is now to create a DataFrame called `Q22` that splits the personality type into the various parts and creates a new column for each part (these columns should be called `interaction` that will represent the first letter in the personality type, `information` for the second letter, `decision` for the third, and `organization` for the fourth).

Again, since most machine learning algorithms work with numeric data, perform the following on the four new columns:
- `interaction` --> code all `I`'s as `0` and `E`'s as `1`
- `information` --> code all `S`'s as `0` and `N`'s as `1`
- `decision` --> code all `T`'s as `0` and `F`'s as `1`
- `organization` --> code as `J`'s with `0` and `P`'s as `1`
- Any missing values should be coded with a `2`

For example, if a contestant's personality type was `ENTJ`, your columns for that row would be:
- `1` for `interaction` because of the `E`
- `1` for `information` because of the `N`
- `0` for `decision` because of the `T` 
- `0` for `organization` because of the `J`

The new DataFrame should be sorted in `castaway_id` order and have the following columns in this order: `castaway_id`, `full_name`, `personality_type`, `interaction`, `information`, `decision`, `organization`.

Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment.  Make sure you use `copy()` if needed.

In [81]:
castaway_details

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
0,1,Sonja Christopher,Sonja,1937-01-28,NaT,Female,,,Musician,ENFP
1,2,B.B. Andersen,B.B.,1936-01-18,2013-10-29,Male,,,Real Estate Developer,ESTJ
2,3,Stacey Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,ENTJ
3,4,Ramona Gray,Ramona,1971-01-20,NaT,Female,Black,,Biochemist/Chemist,ISTJ
4,5,Dirk Been,Dirk,1976-06-15,NaT,Male,,,Dairy Farmer,ISFP
...,...,...,...,...,...,...,...,...,...,...
603,604,Tiffany Seely,Tiffany,1973-12-08,NaT,Female,White,Jewish,Teacher,ENTP
604,605,Sydney Segal,Sydney,1995-07-19,NaT,Female,White,Jewish,Law Student,ESTP
605,606,Shantel Smith,Shan,1987-03-11,NaT,Female,Black,,Pastor,ENFJ
606,607,David Voce,Voce,1986-05-01,NaT,Male,,,Neurosurgeon,ENTJ


**Q23:** Using data from `castaways`, create a DataFrame called `Q23` that bins the contestant ages (their age when they were on the season, not their current age) into the following age categories:
- 18-24
- 25-34
- 35-44
- 45-54
- 55-64
- 65+

The final DataFrame should have the following columns in this order: `season`, `castaway_id`, `full_name`, `age`, and `age_category`.  The DataFrame should be sorted by age and then castaway_id.  The index should be 0 through n-1.  You should have the same amount of rows as in the `castaways` DataFrame.

Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment.  Make sure you use `copy()` if needed.

In [92]:
castaways.age.max()

75

In [91]:
bins = [18,24,34,44,54,64,76]

In [101]:
Q23_df = castaways.copy()
Q23_df = Q23_df[['season', 'castaway_id', 'full_name', 'age']]
Q23_df.shape

(762, 4)

In [122]:
bins = [17,24,34,44,54,64,76]

Q23_df['age_category'] = pd.cut(castaways.age, bins, labels = ['18-24','25-34','35-44','45-54','55-64', '65+'])

In [123]:
Q23_df['age_category']

0      25-34
1      25-34
2      18-24
3      45-54
4      25-34
       ...  
757    18-24
758    25-34
759    25-34
760    55-64
761    55-64
Name: age_category, Length: 762, dtype: category
Categories (6, object): ['18-24' < '25-34' < '35-44' < '45-54' < '55-64' < '65+']

In [124]:
Q23 = Q23_df.sort_values(by=['age', 'castaway_id']).reset_index(drop=True)
Q23

Unnamed: 0,season,castaway_id,full_name,age,age_category
0,33,491,Will Wahl,18,18-24
1,36,528,Michael Yerger,18,18-24
2,18,270,Spencer Duhm,19,18-24
3,22,336,Natalie Tenerelli,19,18-24
4,23,350,Brandon Hantz,19,18-24
...,...,...,...,...,...
757,24,366,Greg Smith,64,55-64
758,21,304,Jimmy Johnson,67,65+
759,32,474,Joseph Del Campo,71,65+
760,1,14,Rudy Boesch,72,65+


**Q24:** Based on the age categories you created above, what are the normalized percentages for the various age categories using `value_counts()`.  Sort the value counts by index.  Save this as `Q24`.

In [125]:
Q24 = Q23_df['age_category'].value_counts(normalize=True).sort_index()
Q24

18-24    0.190289
25-34    0.437008
35-44    0.211286
45-54    0.124672
55-64    0.031496
65+      0.005249
Name: age_category, dtype: float64

**Q25:** Which contestant(s) played a perfect game?  A perfect game is considered when the contestant:
- didn't receive any tribal council votes all season
- won the game
- got unanimous jury votes (see question 17)

Save this DataFrame as `Q25` with the following columns: `season_name`, `season`, `castaway_id`, `full_name`, `tribal_council_votes`, `jury_votes`.  The DataFrame should be sorted by season and the index should be 0 to n-1.

In [110]:
jury_votes.head()

Unnamed: 0,season_name,season,castaway,finalist,vote,castaway_id,finalist_id
0,Survivor: 41,41,Heather,Deshawn,0,593,601
1,Survivor: 41,41,Ricard,Deshawn,0,596,601
2,Survivor: 41,41,Danny,Deshawn,1,599,601
3,Survivor: 41,41,Liana,Deshawn,0,608,601
4,Survivor: 41,41,Shan,Deshawn,0,606,601
