|  |  |
| - | - |
| ![swlego](swlego.png) | The introduction of Lego's first licensed series, Star Wars, was a hit that sparked a series of collaborations with more themed sets. _The partnerships team has asked you to perform an analysis of this success, and before diving into the analysis, they have suggested reading the descriptions of the two datasets to use, reported below._ | 

> ### Please note:
> _This was very much a working notebook, in which I've tried to carry my understanding of basic pandas to tackle the tasks. There will be a few snippets that are either highly inefficient, non-pythonic or down-right incorrect. That said, I wanted to include these snippets for my own retrospective learning._

## The Data

You have been provided with two datasets to use. A summary and preview are provided below.

## lego_sets.csv

| Column     | Description              |
|------------|--------------------------|
| `"set_num"` | A code that is unique to each set in the dataset. This column is critical, and a missing value indicates the set is a duplicate or invalid! |
| `"name"` | The name of the set. |
| `"year"` | The date the set was released. |
| `"num_parts"` | The number of parts contained in the set. This column is not central to our analyses, so missing values are acceptable. |
| `"theme_name"` | The name of the sub-theme of the set. |
| `"parent_theme"` | The name of the parent theme the set belongs to. Matches the name column of the parent_themes csv file.
|

## parent_themes.csv

| Column     | Description              |
|------------|--------------------------|
| `"id"` | A code that is unique to every theme. |
| `"name"` | The name of the parent theme. |
| `"is_licensed"` | A Boolean column specifying whether the theme is a licensed theme. |

## Task
- What percentage of all licensed sets ever released were Star Wars themed? Save your answer as a variable the_force, as an integer.

- In which year was the highest number of Star Wars sets released? Save your answer as a variable new_era, as an integer (e.g. 2012).


In [1]:
import pandas as pd

lego_sets = pd.read_csv('data/lego_sets.csv')
lego_sets.head()

Unnamed: 0,set_num,name,year,num_parts,theme_name,parent_theme
0,00-1,Weetabix Castle,1970,471.0,Castle,Legoland
1,0011-2,Town Mini-Figures,1978,,Supplemental,Town
2,0011-3,Castle 2 for 1 Bonus Offer,1987,,Lion Knights,Castle
3,0012-1,Space Mini-Figures,1979,12.0,Supplemental,Space
4,0013-1,Space Mini-Figures,1979,12.0,Supplemental,Space


In [2]:
parent_themes = pd.read_csv('data/parent_themes.csv')
parent_themes.head()

Unnamed: 0,id,name,is_licensed
0,1,Technic,False
1,22,Creator,False
2,50,Town,False
3,112,Racers,False
4,126,Space,False


### What percentage of all licensed sets ever released were Star Wars themed? 
_Save your answer as a variable the_force, as an integer._

In [3]:
# firstly I will merge the datasets, joining on df1['parent name'] and df2['name']
merged_df = lego_sets.merge(parent_themes, left_on='parent_theme', right_on='name')
# expecting '4 juniors' to be the first parent theme if I order values from A-Z...
merged_df.sort_values(by='parent_theme',ascending=True).head()

Unnamed: 0,set_num,name_x,year,num_parts,theme_name,parent_theme,id,name_y,is_licensed
9905,4619-1,A.I.R. Patrol Jet,2002,,Airport,4 Juniors,279,4 Juniors,False
9909,4651-1,Police Motorcycle,2003,12.0,Police,4 Juniors,279,4 Juniors,False
9908,4622-1,Res-Q Digger,2002,67.0,Traffic,4 Juniors,279,4 Juniors,False
9907,4621-1,Jack Stone Red Flash Station,2002,32.0,Fire,4 Juniors,279,4 Juniors,False
9906,4620-1,A.I.R. Operations HQ,2002,,Airport,4 Juniors,279,4 Juniors,False


In [4]:
# confirm the total number of unique sets in the entire list
total_set_count = merged_df['set_num'].value_counts().sum()
total_set_count

11833

In [5]:
# get a total count of all the rows that feature 'Star Wars' as the parent_theme
star_wars_themed = merged_df['parent_theme'].str.contains('Star Wars', na=False).sum()
star_wars_themed

609

In [6]:
# What percentage (integer) of all licensed sets ever released were Star Wars themed?
the_force = (star_wars_themed / total_set_count * 100)
the_force

5.1466238485591145

This doesn't look right, even as a _float..._

On review of the data-set, I realise now that **not all Star Wars 'sub-themed' lego sits under the parent_theme**: Star Wars.

In [7]:
# get a total count of all the rows that feature 'Star Wars' as the name
star_wars_themed = merged_df['theme_name'].str.contains('Star Wars', na=False).sum()
star_wars_themed

# What percentage (integer) of all licensed sets ever released were Star Wars themed?
the_force = (star_wars_themed / total_set_count * 100)
the_force

5.476210597481619

We've gone dramatically the other way... **_and this was my error_** - I misread the question. 

It's the percentage of all **_licensed_** sets only. 
I need to get a total count of all sets, excluding all non-licensed. 

In [8]:
# double-checking that I'm dealing with boolean in the licensed column
print(merged_df['is_licensed'].apply(type).unique())

[<class 'bool'>]


In [9]:
# filter out the non-licensed sets
licensed_only = merged_df[merged_df['is_licensed'] != False]

# confirm the total number of unique sets in the entire licensed list
licensed_only = merged_df[merged_df['is_licensed'] == True]
total_set_count = licensed_only['set_num'].nunique()
total_set_count

1179

In [10]:
# get a revised total count of all the rows that feature 'Star Wars' as the theme_name
star_wars_count = licensed_only['theme_name'].str.contains('Star Wars', na=False).sum()
star_wars_count

541

Let's try this again...

In [11]:
# What percentage (integer) of all licensed sets ever released were Star Wars themed?
the_force = int((star_wars_count / total_set_count) * 100)
the_force

45

I assumed being the first collaboration, it would be high. And with the resurgence of Star Wars following the merchandise-focused Disney. 45% sounds appropriate. 

----------
### In which year was the highest number of Star Wars sets released? 
_Save your answer as a variable new_era, as an integer_

In [12]:
sw_released = licensed_only[licensed_only['parent_theme'] == 'Star Wars']
sw_released.head(10)

Unnamed: 0,set_num,name_x,year,num_parts,theme_name,parent_theme,id,name_y,is_licensed
3493,10018-1,Darth Maul,2001,1868.0,Star Wars,Star Wars,158,Star Wars,True
3494,10019-1,Rebel Blockade Runner - UCS,2001,,Star Wars Episode 4/5/6,Star Wars,158,Star Wars,True
3495,10026-1,Naboo Starfighter - UCS,2002,,Star Wars Episode 1,Star Wars,158,Star Wars,True
3496,10030-1,Imperial Star Destroyer - UCS,2002,3115.0,Star Wars Episode 4/5/6,Star Wars,158,Star Wars,True
3497,10123-1,Cloud City,2003,707.0,Star Wars Episode 4/5/6,Star Wars,158,Star Wars,True
3498,10129-1,Rebel Snowspeeder - UCS,2003,1456.0,Star Wars Episode 4/5/6,Star Wars,158,Star Wars,True
3499,10131-1,TIE Fighter Collection,2004,,Star Wars Episode 4/5/6,Star Wars,158,Star Wars,True
3500,10134-1,Y-wing Attack Starfighter - UCS,2004,,Star Wars Episode 4/5/6,Star Wars,158,Star Wars,True
3501,10143-1,Death Star II,2005,,Star Wars Episode 4/5/6,Star Wars,158,Star Wars,True
3502,10144-1,Sandcrawler,2005,1679.0,Star Wars Episode 4/5/6,Star Wars,158,Star Wars,True


In [13]:
# finding the top 5 years  
new_era_topfive = sw_released['year'].value_counts().sort_values(ascending=False).head(5)
new_era_topfive

2016    61
2015    58
2017    55
2014    45
2012    43
Name: year, dtype: int64

In [14]:
# saving the year value as a integer variable named 'new_era' 
new_era = int(sw_released['year'].value_counts().sort_values(ascending=False).head(1).index[0])
new_era

2016

**2016** saw the highest number of Star Wars sets released, followed closely by neighbouring years. This makes sense considering [The Force Awakens](https://en.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens) was released at this time, seeing a new era of Star Wars popularity.