In [1]:
# initializing otter-grader
import otter
grader = otter.Notebook()

# Data Exploration and Visualization

In this week's lab, your answers will not be primarily graded through your code, but through your written analysis. As a Data Scientist, programming is important, but adding analysis to complement your code and explaining what your code has unveiled are essential skills. This lab does not require statistical analysis, but it will require you to have a curious and analytical mind.

## The goals for this lab
* consider ethical implications of using available datasets and develop a framework of ethical questions to examine
* get a quick overview of the dataset using `info()` and `describe()`
* extract subsets of the dataframe by column and by rows that match a given expression
* learn the advantages of and how to change the type of the columns using `astype()` function
* use `value_counts()` to get the frequencies of values in a given column
* review working with `pd.Series` (indexing and extracting values)
* plot different types of bar charts, configure chart properties, including using `sort`
* synthesize the above material and engage your spirit of exploration to independently analyze the provided dataset and find additional insights.

**Reminder: this lab should be done individually.** You can discuss the big-picture ideas with others but the final code and analysis should be your own. You should not share your code or answers directly with other students; please complete your own work and keep it to yourself. **You are required to read and to abide by the policies listed on The Office of Student Conduct website**: https://studentconduct.sa.ucsb.edu/academic-integrity.

## Standard Imports

As always, import the modules we will need throughout this lab. We will also suppress a warning that has no bearing over this lab.

In [2]:
import csv
import pandas as pd
import altair as alt
pd.options.mode.chained_assignment = None

## Exploring the Data

In this week's lab, we will be exploring a dataset consisting of Kickstarter projects that have either reached their funding goals or not. To load our dataset, we will be using Pandas' standard `read_csv()` method to pull our text file into a Dataframe that we can manipulate. This dataset is found at [this link](https://www.kaggle.com/yashkantharia/kickstarter-campaigns/data).

In [3]:
df = pd.read_csv('ks-projects.csv')
df.head()

Unnamed: 0,id,name,currency,main_category,sub_category,launched_at,deadline,duration,goal_usd,city,state,country,blurb_length,name_length,status,start_month,end_month,start_Q,end_Q,usd_pledged
0,1687733153,Socks of Speed and Socks of Elvenkind,USD,games,Tabletop Games,2018-10-30 20:00:02,2018-11-15 17:59:00,16.0,2000.0,Menasha,WI,US,14,7,successful,10,11,Q4,Q4,6061.0
1,227936657,Power Punch Boot Camp: An All-Ages Graphic Novel,GBP,comics,Comic Books,2018-08-06 10:00:43,2018-09-05 10:00:43,30.0,3870.99771,Shepperton,England,GB,24,8,successful,8,9,Q3,Q3,3914.50512
2,454186436,"Live Printing with SX8: ""Squeegee Pulp Up""",USD,fashion,Apparel,2017-06-09 15:41:03,2017-07-09 15:41:03,30.0,1100.0,Manhattan,NY,US,21,7,successful,6,7,Q2,Q3,1110.0
3,629469071,Lost Dog Street Band's Next Album,USD,music,Country & Folk,2014-09-25 18:46:01,2014-11-10 06:00:00,45.0,3500.0,Nashville,TN,US,15,6,successful,9,11,Q3,Q4,4807.0
4,183973060,"Qto-X, a Tiny Lantern",USD,technology,Gadgets,2016-11-28 16:35:11,2017-01-27 16:35:11,60.0,30000.0,Troy,MI,US,15,4,successful,11,1,Q4,Q1,40368.0


### Question 0: Ethical Considerations

Before we even begin analyzing these data, we need to make sure that there are no obvious ethical issues with this dataset or with our usage of it.

See if you can figure out:

* Who collected this dataset and why?
* Can these usage or the analysis of this dataset cause any harm to those represented in the dataset? To others?
* Is there a license that tells others how to use and attribute the authors?
* Who or what is represented in the data? Is someone or something over-represented? Who or what _is not represented_ in the data?
* Are the values precise enough to answer the question of interest?
* Did the measurement process potentially distort the system under study?
* Are there other potential ethical issues?

Try to answer these questions by looking at the source of the dataset, the site that it is hosted on, the values that are stored in it, etc. Come back to this question after you are finished with the lab and see if there's anything else that you discovered that you can add here.

<!--
BEGIN QUESTION
name: q0
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

## Overview of the Data

Before we delve into the dataset with a more analytic viewpoint, it is always a good idea to look at what the dataset contains from a birds-eye view. Besides the `head()` function used above, there are two important functions to give a summary of a dataframe.

First, `info()` gives us an overview of how many columns there are, what types each column contains, and how many rows we have total.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192548 entries, 0 to 192547
Data columns (total 20 columns):
id               192548 non-null int64
name             192548 non-null object
currency         192548 non-null object
main_category    192548 non-null object
sub_category     192548 non-null object
launched_at      192548 non-null object
deadline         192548 non-null object
duration         192548 non-null float64
goal_usd         192548 non-null float64
city             192548 non-null object
state            192548 non-null object
country          192548 non-null object
blurb_length     192548 non-null int64
name_length      192548 non-null int64
status           192548 non-null object
start_month      192548 non-null int64
end_month        192548 non-null int64
start_Q          192548 non-null object
end_Q            192548 non-null object
usd_pledged      192548 non-null float64
dtypes: float64(3), int64(5), object(12)
memory usage: 29.4+ MB


The `describe()` function gives us summary statistics of numerical data within our dataframe. Notice how it only describes the columns that contain `int64` or `float64` types. 

In [5]:
df.describe()

Unnamed: 0,id,duration,goal_usd,blurb_length,name_length,start_month,end_month,usd_pledged
count,192548.0,192548.0,192548.0,192548.0,192548.0,192548.0,192548.0,192548.0
mean,1072709000.0,32.362907,37049.9,18.938322,5.767897,6.512168,6.789845,13711.66
std,619481000.0,11.610338,1036236.0,4.976948,2.705431,3.32441,3.357369,90788.06
min,8624.0,1.0,0.01,1.0,1.0,1.0,1.0,0.0
25%,534859100.0,30.0,1500.0,16.0,4.0,4.0,4.0,149.0
50%,1074643000.0,30.0,5000.0,20.0,6.0,7.0,7.0,1783.668
75%,1607955000.0,34.0,12470.57,22.0,8.0,9.0,10.0,7146.154
max,2147476000.0,93.0,129033300.0,35.0,27.0,12.0,12.0,8596475.0


### Question 1

When we're looking at a dataset containing many columns, it is almost inevitable that not all columns are useful. In today's lab, we want to look at the following columns in the following order: 
1. "currency"
1. "main_category"
1. "sub_category"
1. "duration"
1. "goal_usd"
1. "country"
1. "blurb_length"
1. "name_length"
1. "status"

Recall what you did in last week's lab and select those columns and place them into a new dataframe.

<!--
BEGIN QUESTION
name: q1
manual: false
points: 5
gradescope: show
-->

In [6]:
df_cols = ['currency', "main_category", "sub_category", "duration", "goal_usd", "country",
          "blurb_length", "name_length", "status"]
df_new = df[df_cols]
df_new.head()

Unnamed: 0,currency,main_category,sub_category,duration,goal_usd,country,blurb_length,name_length,status
0,USD,games,Tabletop Games,16.0,2000.0,US,14,7,successful
1,GBP,comics,Comic Books,30.0,3870.99771,GB,24,8,successful
2,USD,fashion,Apparel,30.0,1100.0,US,21,7,successful
3,USD,music,Country & Folk,45.0,3500.0,US,15,6,successful
4,USD,technology,Gadgets,60.0,30000.0,US,15,4,successful


In [7]:
grader.check('q1')

Let's get a new overview over our dataframe using `info()`.

In [8]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192548 entries, 0 to 192547
Data columns (total 9 columns):
currency         192548 non-null object
main_category    192548 non-null object
sub_category     192548 non-null object
duration         192548 non-null float64
goal_usd         192548 non-null float64
country          192548 non-null object
blurb_length     192548 non-null int64
name_length      192548 non-null int64
status           192548 non-null object
dtypes: float64(2), int64(2), object(5)
memory usage: 13.2+ MB


Display the summary statistics of the new dataframe.

In [9]:
df_new.describe()

Unnamed: 0,duration,goal_usd,blurb_length,name_length
count,192548.0,192548.0,192548.0,192548.0
mean,32.362907,37049.9,18.938322,5.767897
std,11.610338,1036236.0,4.976948,2.705431
min,1.0,0.01,1.0,1.0
25%,30.0,1500.0,16.0,4.0
50%,30.0,5000.0,20.0,6.0
75%,34.0,12470.57,22.0,8.0
max,93.0,129033300.0,35.0,27.0


### Cleaning our Dataframe's types

One thing you might have noticed in the info for these dataframes is that some columns are of type `object`. This is an artifact of how `pandas` loads data from csv files. A pandas type `object` means that when it tries to load in the csv, pandas could not determine the type of the column, e.g., whether it was a string, category, time, or a column with mixed types. Using a specific pandas type that isn't `object` can allow us to manipulate the dataframe in more ways than before.

We know by looking at the data, that these columns should be categorical variables. How can we change these columns so we can more easily manipulate and analyze them?

We can cast the dataframe's columns to a categorical type using Pandas' `astype()` functionality.

In [10]:
df_new.loc[:,"main_category"] = df_new.loc[:,"main_category"].astype('category')
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192548 entries, 0 to 192547
Data columns (total 9 columns):
currency         192548 non-null object
main_category    192548 non-null category
sub_category     192548 non-null object
duration         192548 non-null float64
goal_usd         192548 non-null float64
country          192548 non-null object
blurb_length     192548 non-null int64
name_length      192548 non-null int64
status           192548 non-null object
dtypes: category(1), float64(2), int64(2), object(4)
memory usage: 11.9+ MB


Notice how the type of column `"main_category"` got changed to `"category"`.

Categorical variables have multiple benefits:
* We can define custom sort orders like Small < Medium < Large
* They are more efficient to group by
* Dataframes with categorical variables typically use less memory

### Question 2

Change all the other columns in `df_new` that have a type "object" to a categorical variable. Because usually dataframes can contain many more columns with this pandas loading artifact, it is more efficient for programmers to modify these columns using a loop.

<!--
BEGIN QUESTION
name: q2
manual: false
points: 5
gradescope: show
-->

In [11]:
columns_to_change = ["currency", "sub_category", "country", "status"]

for col_name in columns_to_change:
    df_new.loc[:, col_name] = df_new.loc[:, col_name].astype('category')
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192548 entries, 0 to 192547
Data columns (total 9 columns):
currency         192548 non-null category
main_category    192548 non-null category
sub_category     192548 non-null category
duration         192548 non-null float64
goal_usd         192548 non-null float64
country          192548 non-null category
blurb_length     192548 non-null int64
name_length      192548 non-null int64
status           192548 non-null category
dtypes: category(5), float64(2), int64(2)
memory usage: 7.0 MB


In [12]:
grader.check('q2')

One additional benefit of doing these conversions is that categorical variables take up less memory than objects or strings. Notice how the size of the dataframe went from 11.8+ MB to 6.8 MB.

## Finding Faulty Rows

Most datasets contain rows with faulty data. Sometimes, we will see rows of data that contain a null value, signaling that some information is missing. Keeping these rows in our dataset can cause us to make faulty analysis. 

When we run `df_new.info()`, the output shows us that all our columns are "non-null," meaning that there are no missing values. However, this doesn't mean that we don't have bad data in our dataset. Let's take a look at the description of numerical values again.

In [13]:
df_new.describe()

Unnamed: 0,duration,goal_usd,blurb_length,name_length
count,192548.0,192548.0,192548.0,192548.0
mean,32.362907,37049.9,18.938322,5.767897
std,11.610338,1036236.0,4.976948,2.705431
min,1.0,0.01,1.0,1.0
25%,30.0,1500.0,16.0,4.0
50%,30.0,5000.0,20.0,6.0
75%,34.0,12470.57,22.0,8.0
max,93.0,129033300.0,35.0,27.0


### Question 3a

Notice that the minimum value of `goal_usd` is `1.000000e-02`, which is \$0.01. If we take a look at a random project on Kickstarter, we know that a project's goal has no value after the decimal place (i.e. full dollar amounts). Filter the dataframe so that we only have projects that have `goal_usd >= 1`. 

<!--
BEGIN QUESTION
name: q3a
manual: false
points: 5
gradescope: show
-->

In [14]:
df_clean = df_new[df_new['goal_usd'] >= 1]
df_clean

Unnamed: 0,currency,main_category,sub_category,duration,goal_usd,country,blurb_length,name_length,status
0,USD,games,Tabletop Games,16.0,2000.00000,US,14,7,successful
1,GBP,comics,Comic Books,30.0,3870.99771,GB,24,8,successful
2,USD,fashion,Apparel,30.0,1100.00000,US,21,7,successful
3,USD,music,Country & Folk,45.0,3500.00000,US,15,6,successful
4,USD,technology,Gadgets,60.0,30000.00000,US,15,4,successful
...,...,...,...,...,...,...,...,...,...
192543,NOK,food,Drinks,60.0,57858.66500,NO,21,3,failed
192544,NOK,technology,Fabrication Tools,47.0,115717.33000,NO,18,2,failed
192545,USD,food,Drinks,30.0,30000.00000,US,13,3,failed
192546,USD,art,Conceptual Art,60.0,1200.00000,US,3,3,failed


In [15]:
grader.check("q3a")

## Word of Caution about Cleaning Data

Just because a set of rows in a dataset may seem faulty at first glance doesn't mean that they should be removed. In reality, there might be some underlying reason why you see certain values, so you should always be very careful when discarding any data. 

In the question 3a, we removed rows that had a funding goal of under a dollar. However, we overlooked a possible explanation for why those prices existed in our dataset.


### Question 3b

Is there any reason why the rows we removed from out dataframe should not have been removed? Let's go back to our `df_new` dataset to retrieve the rows that we removed and explore them. Then, answer the question in the markdown cell below.

<!--
BEGIN QUESTION
name: q3b
manual: false
points: 5
gradescope: show
-->

In [16]:
df_removed_projects = df_new[df_new['goal_usd'] < 1]
df_removed_projects

Unnamed: 0,currency,main_category,sub_category,duration,goal_usd,country,blurb_length,name_length,status
20299,JPY,comics,Comic Books,30.0,0.913204,JP,10,9,successful
21785,AUD,design,Graphic Design,60.0,0.713256,AU,22,2,successful
24891,DKK,music,Rock,30.0,0.755907,DK,25,6,successful
29804,AUD,games,Tabletop Games,8.0,0.713256,AU,20,4,successful
30936,CAD,film & video,Documentary,45.0,0.757089,CA,20,7,successful
32366,CAD,games,Video Games,31.0,0.757089,CA,19,1,successful
32623,CAD,music,Electronic Music,23.0,0.757089,CA,8,6,successful
34622,CAD,music,Country & Folk,58.0,0.757089,CA,24,3,successful
34873,AUD,design,Graphic Design,11.0,0.713256,AU,9,10,successful
38294,JPY,comics,Comic Books,30.0,0.909681,JP,10,9,successful


In [17]:
grader.check("q3b")

Why shouldn't we have removed most of those rows? What might explain why they ended up being "faulty"?

*Hint: One of the categorical variables in the dataset might provide an explanation.*

<!--
BEGIN QUESTION
name: q3b1
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

*Answer Here*

They may have gotten funding before getting money

## Data Exploration

Now that our data has been cleaned to a certain extent, let's explore what other insights we can find. First, let's take a look at the different main categories represented in the dataset.

To get a count of all the categories present in a column, we want to first use a `value_counts()` function, which returns a dataframe that contains the count of each category in a column.

In [18]:
main_category_counts = df_clean["main_category"].value_counts()
main_category_counts

music           25844
film & video    25619
technology      19590
art             19042
publishing      18676
food            14961
games           12563
fashion         10811
comics           8166
design           7717
photography      7346
crafts           6530
theater          6451
journalism       5287
dance            3924
Name: main_category, dtype: int64

Note that the result from the `value_counts()` is conveniently returned in the descending order of the counts. The `"music"` category seems to have the most projects in this dataset.

To visualize this result better, we will want to use a bar chart to show the counts of each category. `value_counts()` returns a `pd.Series`, so we need to convert it into a Dataframe for use in Altair. 

To get the list of categories that match up with the counts, we use `df.axes` to get the name of each value in the series. Because of how `pd.Series` axes are stored, we need to select the first item (at index 0) in the `Series.axes` to properly get the list of categories and then use the `pd.Series.values` to retrieve the corresponding counts.

In [19]:
category_counts = pd.DataFrame({
    "cat": main_category_counts.axes[0],
    "count": main_category_counts.values
})
category_counts

Unnamed: 0,cat,count
0,music,25844
1,film & video,25619
2,technology,19590
3,art,19042
4,publishing,18676
5,food,14961
6,games,12563
7,fashion,10811
8,comics,8166
9,design,7717


We can now use the resulting dataframe to create a bar chart using `mark_bar()`. Let's encode the category `"cat"` values as the x-axis and the corresponding counts as the y-axis.

In [20]:
alt.Chart(category_counts).mark_bar().encode(
    x = alt.X("cat", axis = alt.Axis(title = "Category")),
    y = alt.Y("count", axis = alt.Axis(title = "Count"))
)

Notice that by default Altair placed the category labels in alphabetical order on the x-axis. However, this visualization of categories is not very informative, as the heights of the bars are all over the place. 

We can actually sort the counts right inside the visualization. To do this, we can set the `sort` parameter in either the X or Y axis to the other axis (i.e., the axis that we want to sort by). In this case, we want to sort the categories on the x-axis in _descending_ order by the value on the y-axis, so we set sort to `-y`.

In [21]:
alt.Chart(category_counts).mark_bar().encode(
    x = alt.X("cat", axis = alt.Axis(title = "Category"), sort='-y'),
    y = alt.Y("count", axis = alt.Axis(title = "Count"))
)

If we now put the the quantitative value on the x-axis, we get a [horizontal bar chart](https://altair-viz.github.io/gallery/bar_chart_horizontal.html), which make the category labels easier to read.

In [22]:
alt.Chart(category_counts).mark_bar().encode(
    y = alt.Y("cat", axis = alt.Axis(title = "Category")),
    x = alt.X("count", axis = alt.Axis(title = "Count"))
)

Now, add the `sort` parameter to make the "music" category be at the top of the chart with the other categories sorted accordingly.

In [23]:
alt.Chart(category_counts).mark_bar().encode(
    y = alt.Y("cat", axis = alt.Axis(title = "Category"), sort='-x'),
    x = alt.X("count", axis = alt.Axis(title = "Count"))
)

Lastly, use the lecture notes in order to adjust the font of the labels, axes, and to add a meaningful title to the chart. 

*Hint: you'll need to modify options in the`configure_axis`, `configure_title`, and `properties`.*

<!--
BEGIN QUESTION
name: q3viz
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

In [24]:
# REPLACE "..." WITH PLOTTING CODE
alt.Chart(category_counts).mark_bar().encode(
    y = alt.Y("cat", axis = alt.Axis(title = "Category"), sort='-x'),
    x = alt.X("count", axis = alt.Axis(title = "Count"))
).properties(
    title=" Kickstart Project Category Frequency"
).configure_title(fontSize=18).configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
)

### Question 4

What does this graph tell you about the projects that are asking for funding on Kickstarter? 

At a minimum, comment on the categories and their frequencies. Include any additional observation and analysis.
Answer in the markdown cell below.

<!--
BEGIN QUESTION
name: q4
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

Mostly music/film & video projects - the latter could be tied to the first as well. Tech is 3rd, which is surprising given that KS is usually ascociated with tech. Journalism and dance last.

### Question 5a

Now let's take a look at the main categories for projects that were successfully funded. Filter our clean dataset by rows where `status == "successful"`. 

<!--
BEGIN QUESTION
name: q5a
manual: false
points: 5
gradescope: show
-->

In [25]:
df_success = df_clean[df_clean['status'] == "successful"]
df_success.head(20)

Unnamed: 0,currency,main_category,sub_category,duration,goal_usd,country,blurb_length,name_length,status
0,USD,games,Tabletop Games,16.0,2000.0,US,14,7,successful
1,GBP,comics,Comic Books,30.0,3870.99771,GB,24,8,successful
2,USD,fashion,Apparel,30.0,1100.0,US,21,7,successful
3,USD,music,Country & Folk,45.0,3500.0,US,15,6,successful
4,USD,technology,Gadgets,60.0,30000.0,US,15,4,successful
5,USD,music,Country & Folk,30.0,7500.0,US,11,4,successful
6,USD,film & video,Drama,30.0,6000.0,US,3,2,successful
7,EUR,design,Graphic Design,30.0,1133.68788,DE,17,8,successful
8,USD,art,Sculpture,30.0,3000.0,US,19,3,successful
9,USD,music,Country & Folk,60.0,1000.0,US,27,10,successful


In [26]:
grader.check("q5a")

### Question 5b

Once you have filtered the dataset, follow the example to create a bar chart of the counts of categories of successful projects.

<!--
BEGIN QUESTION
name: q5b
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

In [27]:
success_counts = df_success["main_category"].value_counts()

category_counts_2 = pd.DataFrame({
    "cat": success_counts.axes[0],
    "count": success_counts.values
})

alt.Chart(category_counts_2).mark_bar().encode(
    y = alt.Y("cat", axis = alt.Axis(title = "Category"), sort='-x'),
    x = alt.X("count", axis = alt.Axis(title = "Count"))
).properties(
    title=" Kickstart Project Category Frequency"
).configure_title(fontSize=18).configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
)

### Question 5c

What do you notice is different about this plot? Why do you see these differences? Provide your analysis on the differences in the markdown cell below.

<!--
BEGIN QUESTION
name: q5c
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

*Answer Here*

Technology projects success rate is low, since it ranks lower in success than total projects. Games retain a high success rate.

### Question 6

As your final task for this lab, create your own analysis on another aspect of `df_clean`. Here are a couple ideas to get you started. Please create as many cells as needed below this one for your plots and your code. At the end, remember to replace the last markdown cell with your analysis. Points will be awarded for the _clarity_ of your analysis, not on the complexity of it. 

Some example questions that you might ask:

* What do the spreads of the funding goals look like for different categories and why?
* We took a look at the main categories, but what about sub categories? Does this have a role to play in the more volatile main categories?
* What do the word counts of the blurbs and the names of each project have to do with their success?

All the columns in our clean dataframe are listed below:
* main_category - Category that the project is listed under
* sub_category - Category within the main category that the project is listed under
* duration - How many days is the Kickstarter open for funding
* goal_usd - Minimum goal in USD for the project to succeed
* country - Country that the Kickstarter is listed in
* blurb_length - Word count of the blurb that explains the Kickstarter project
* name_length - Word count of the name of the Kickstarter project
* status - Whether the Kickstarter was "successful" or "failed" 

Before you submit your lab, go back to Question 0 and add any additional insights about ethical and principles of measurement considerations that you might have found.

<!--
BEGIN QUESTION
name: q6
manual: true
points: 10
gradescope: show
-->
<!-- EXPORT TO PDF -->

In [28]:
df_clean.head(5)

Unnamed: 0,currency,main_category,sub_category,duration,goal_usd,country,blurb_length,name_length,status
0,USD,games,Tabletop Games,16.0,2000.0,US,14,7,successful
1,GBP,comics,Comic Books,30.0,3870.99771,GB,24,8,successful
2,USD,fashion,Apparel,30.0,1100.0,US,21,7,successful
3,USD,music,Country & Folk,45.0,3500.0,US,15,6,successful
4,USD,technology,Gadgets,60.0,30000.0,US,15,4,successful


In [29]:
df_clean["main_category"]

0              games
1             comics
2            fashion
3              music
4         technology
             ...    
192543          food
192544    technology
192545          food
192546           art
192547        crafts
Name: main_category, Length: 192527, dtype: category
Categories (15, object): [art, comics, crafts, dance, ..., photography, publishing, technology, theater]

In [30]:
main_category_counts = df_clean["main_category"].value_counts()
main_category_counts

music           25844
film & video    25619
technology      19590
art             19042
publishing      18676
food            14961
games           12563
fashion         10811
comics           8166
design           7717
photography      7346
crafts           6530
theater          6451
journalism       5287
dance            3924
Name: main_category, dtype: int64

In [31]:
money_total = [int(df_clean['goal_usd'][df_clean['main_category'] == category].sum())
              for category in main_category_counts.axes[0]]

category_money = pd.DataFrame({
    "cat": list(main_category_counts.axes[0]),
    "money": money_total
})

alt.Chart(category_money).mark_bar().encode(
    y = alt.Y("cat:O", axis = alt.Axis(title = "Category"), sort='-x'),
    x = alt.X("money:Q", axis = alt.Axis(title = "Money Goal (USD)"))
).properties(
    title=" Kickstart Project total Money Goal per Category"
).configure_title(fontSize=18).configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
)

'''
film and video use nearly as much money as all other groups combined.
this is surprising to me. i thought technology would be no.1 by a longshot
'''


'\nfilm and video use nearly as much money as all other groups combined.\nthis is surprising to me. i thought technology would be no.1 by a longshot\n'

# Running Built-in Tests
1. All tests are in `tests` directory
1. Each python file in `tests` is a test
1. `grader.check('testname')` runs test `'testname'`, e.g. `'q1'`
1. `grader.check_all()` runs all visible tests

In [32]:
# Run built-in checks
grader.check_all()

In [None]:
# Generate pdf in classic notebook (does not work in JupyterLab)
import nb2pdf
nb2pdf.convert('lab04.ipynb')

# To generate pdf using command-line, run in terminal,
# nb2pdf lab04.ipynb

<IPython.core.display.Javascript object>

# Submission Checklist
1. Check filename is 'lab04.ipynb'
1. Save file to confirm all changes are on disk
1. Run *Kernel > Restart & Run All* to execute all code from top to bottom
1. Check `grader.check_all()` output
1. Save file again to write any new output to disk
1. Check generated pdf that all responses are displayed correctly
1. Submit to Gradescope