# 3 ‚Äì Introduction to Pandas and üöÄ Project

* * * 

### Icons used in this notebook
üîî **Question**: A quick question to help you understand what's going on.<br>
ü•ä **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
üí≠ **Reflection**: Helping you think about programming.<br>
‚ö†Ô∏è **Warning**: Heads-up about tricky stuff or common mistakes.<br>
üí° **Tip**: How to do something a bit more efficiently or effectively.<br>
üé¨ **Demo**: Showing off something more advanced ‚Äì so you know what Python can be used for!<br>

### Learning Objectives
1. [Importing Libraries](#lib)
2. [Importing Data from Files](#data)
3. [Working with Pandas](#pandas)
4. [üöÄ Project](#project)


<a id='lib'></a>

# Importing Libraries

A **library** refers to a reusable chunk of code. Usually, a Python library contains a collection of related functionalities.

We have already been using Python's [standard library](https://docs.python.org/3/library/) - it comes ready and loaded with Python. We've also used `pandas` to work with data frames. Today, we will expand on our Pandas knowledge and do our first data science project.

Before we can use a library like Pandas, we have to **import** it into the current session.
Importing is done with the `import` keyword. We simply run `import [PACKAGE_NAME]`, and everything inside the package becomes available to use.

Let's import the `numpy` module, which has a lot of useful functions for working with numerical data. Let's access a function from this module using dot notation.

In [1]:
import numpy

print('The mean of [1,4,5] is:', numpy.mean([1,4,5]))

The mean of [1,4,5] is: 3.3333333333333335


For many packages, like `numpy`, there is an **alias**, or nickname that they are often imported as. For common packages (especially those with long names), it saves a lot of typing when you use a nickname. For example, `numpy` is usually imported as below:

In [2]:
import numpy as np

print('mean of [1,4,5] is:', np.mean([1,4,5]))

mean of [1,4,5] is: 3.3333333333333335


There are very common abbreviations used for some of the more popular libraries, including:

* `pandas` -> `pd`
* `numpy` -> `np`
* `matplotlib` -> `plt`
* `statsmodels.api` -> `sm`

‚ö†Ô∏è **Warning**: Sometimes aliases can make programs harder to understand, since readers must learn your program's aliases. Be very intentional about using aliases!

### Help!

How do we know what we can do with `numpy`? Usually, packages provide **documentation** which explain these components. We can access this documentation with the `help` function:

In [3]:
help(numpy)

Help on package numpy:

NAME
    numpy

DESCRIPTION
    NumPy
    =====
    
    Provides
      1. An array object of arbitrary homogeneous items
      2. Fast mathematical operations over arrays
      3. Linear Algebra, Fourier Transforms, Random Number Generation
    
    How to use the documentation
    ----------------------------
    Documentation is available in two forms: docstrings provided
    with the code, and a loose standing reference guide, available from
    `the NumPy homepage <https://www.scipy.org>`_.
    
    We recommend exploring the docstrings using
    `IPython <https://ipython.org>`_, an advanced Python shell with
    TAB-completion and introspection capabilities.  See below for further
    instructions.
    
    The docstring examples assume that `numpy` has been imported as `np`::
    
      >>> import numpy as np
    
    Code snippets are indicated by three greater-than signs::
    
      >>> x = 42
      >>> x = x + 1
    
    Use the built-in ``help`` func

You can also view documentation [online](https://docs.python.org/3/library/math.html). 

Being comfortable sifting through documentation is a **very** important skill!

üîî **Question**: You are curious about what is available in the `math` module, so you run `help(math)`. However, you get an error. What went wrong?

In [5]:
# We need to import the module first!
import math
help(math)

Help on module math:

NAME
    math

MODULE REFERENCE
    https://docs.python.org/3.8/library/math
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides access to the mathematical functions
    defined by the C standard.

FUNCTIONS
    acos(x, /)
        Return the arc cosine (measured in radians) of x.
    
    acosh(x, /)
        Return the inverse hyperbolic cosine of x.
    
    asin(x, /)
        Return the arc sine (measured in radians) of x.
    
    asinh(x, /)
        Return the inverse hyperbolic sine of x.
    
    atan(x, /)
        Return the arc tangent (measured in radians) of x.
    
    atan2(y, x, /)
        Return the arc tangent (measured in radians) of y/x.
    

### Importing Specific Components of a Library

We generally want to import only what we need from a library. To do so, we use the `from` keyword. This allows us to import a specific module, function, or variable, and then refer to it directly without the library name as prefix.

Specifically, we use the syntax `from [PACKAGE_NAME] import [COMPONENT]`.

Let's do this with the `numpy` module. From the `numpy.random` module we want to import the `shuffle()` function, which will shuffle a list of items.


In [None]:
from numpy.random import shuffle
test = [1,2,3,4]
shuffle(test)
print(test)

üîî **Question**: There is another module caled `random` in the Python standard library. Knowing that, why might we not want to run `from numpy import random` and `import random` in the same notebook?

## ü•ä Challenge: Locating the Right Library

You want to select a random value from a list of data.

1. What [standard library](https://docs.python.org/3/library/) would you most expect to help? Look at the documentation and find it.
2. Which **function** would you select from that library? üí° **Tip**: Look at "Functions for sequences" in the documentation.
3. Import the library, and apply the function to the following list.

In [6]:
ids = [1, 2, 3, 4, 5, 6]

In [8]:
# YOUR CODE HERE
import random

random.choice(ids)


1

<a id='data'></a>

# Importing data from files

No set of basic skills is complete without learning how to import data from files. 

## Getting your bearings

Before we can get our data, we first have to figure out where the file is on our hard disk! 

We can use `!pwd` to check the location of your "working directory" (the folder on your computer that Python is currently connected to). 

In [9]:
# print working directory
!pwd

/Users/tomvannuenen/Documents/GitHub/DEV/Python-Fundamentals-Revamp/solutions/Fundamentals-I


‚ö†Ô∏è **Warning**: Navigating file paths can be *pretty confusing* üòµ‚Äçüí´, but it's an important skill! 


## Import a .csv file

As data scientists, we'll mostly be working with **Comma Seperated Values (.csv)** files. 

Comma separated values files are common because they are relatively small and look good in spreadsheet software. A comma separated values file is just a text file that contains data but that has commas (or other separators) to indicate column breaks.

`pandas` comes with a function [`.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
that makes it really easy to import .csv files.

üí° **Tip**: Let's have a look at a .csv file in our File Browser!

### Wait, how do I get my files?

The file we want to import are inside a folder called "Data", which is inside of the main "Python-Fundamentals" folder. As you can see in the file path, this directory is two folders "up" from where we currently are. 

üí° **Tip**: Let's use the File Browser to the left of our screen, as well as our Finder (Mac) / File Explorer (Windows), to orient ourselves. 

Have a look at the "gapminder-FiveYearData.csv" file we are importing below.

* The `read_csv()` function takes a string as its main argument. This string consists of the file path pointing to the file.
* `../` means 'go up one level in the folder'.
* `../../` means 'go up two levels in the folder'.
* `data/` means 'go into a folder called "data".
* `gapminder-FiveYearData.csv` is the file name we are accessing within that "data" folder.

In [81]:
import pandas as pd

df = pd.read_csv('../../data/gapminder-FiveYearData.csv')
df.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


üîî **Question**: What does the [gapminder-FiveYearData](https://en.wikipedia.org/wiki/Gapminder_Foundation) dataset seem to be about?

<a id='pandas'></a>

# Working with Pandas

Pandas has hundreds of useful ways for us to work with Data Frames. We will cover a couple of general topics here.

## Slicing Columns
We can choose a single column by selecting the name of that column. The act of obtaining a particular subset of a data frame is often referred to as **slicing**. This uses bracket notation to select part of the data.

Check it out:

In [11]:
df['country']

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

`pandas` calls this a `Series` object. It's like a list, except it's labeled. 

You can slice a Series object just like you can with a list!

In [12]:
gap_country = df['country']
gap_country[0]

'Afghanistan'

`DataFrame` objects also have methods, including those for [merging](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge), [aggregation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), [nulls](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), and others. Many of these functions operate on a single column of the DataFrame. For example, we can identify the number of unique values in each column by using `.nunique()`, and what those unique values are by using `.unique()`:

In [18]:
#number of unique countries in the df
print(df['country'].nunique())

#unique countries in the df
print(df['country'].unique())

142
['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Argentina' 'Australia'
 'Austria' 'Bahrain' 'Bangladesh' 'Belgium' 'Benin' 'Bolivia'
 'Bosnia and Herzegovina' 'Botswana' 'Brazil' 'Bulgaria' 'Burkina Faso'
 'Burundi' 'Cambodia' 'Cameroon' 'Canada' 'Central African Republic'
 'Chad' 'Chile' 'China' 'Colombia' 'Comoros' 'Congo Dem. Rep.'
 'Congo Rep.' 'Costa Rica' "Cote d'Ivoire" 'Croatia' 'Cuba'
 'Czech Republic' 'Denmark' 'Djibouti' 'Dominican Republic' 'Ecuador'
 'Egypt' 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Ethiopia' 'Finland'
 'France' 'Gabon' 'Gambia' 'Germany' 'Ghana' 'Greece' 'Guatemala' 'Guinea'
 'Guinea-Bissau' 'Haiti' 'Honduras' 'Hong Kong China' 'Hungary' 'Iceland'
 'India' 'Indonesia' 'Iran' 'Iraq' 'Ireland' 'Israel' 'Italy' 'Jamaica'
 'Japan' 'Jordan' 'Kenya' 'Korea Dem. Rep.' 'Korea Rep.' 'Kuwait'
 'Lebanon' 'Lesotho' 'Liberia' 'Libya' 'Madagascar' 'Malawi' 'Malaysia'
 'Mali' 'Mauritania' 'Mauritius' 'Mexico' 'Mongolia' 'Montenegro'
 'Morocco' 'Mozambique' 'Myanmar'

## `.head()`, `.describe()`, and `.value_counts()`

The `.head()` method will show the first five rows of a Data Frame by default. Put an integer in the parentheses to specify a different number of rows. 

`.describe()` provides basic summary statistics. 

`.value_counts()` counts frequencies.

In [14]:
# View the first 3 rows
df.head(3)

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071


In [15]:
# Produce some quick summary statistics
df.describe()

Unnamed: 0,year,pop,lifeExp,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,29601210.0,59.474439,7215.327081
std,17.26533,106157900.0,12.917107,9857.454543
min,1952.0,60011.0,23.599,241.165876
25%,1965.75,2793664.0,48.198,1202.060309
50%,1979.5,7023596.0,60.7125,3531.846988
75%,1993.25,19585220.0,70.8455,9325.462346
max,2007.0,1318683000.0,82.603,113523.1329


Now, we can investigate how many of each category?

In [19]:
# How many letters by each writer?
df["year"].value_counts()

1952    142
1957    142
1962    142
1967    142
1972    142
1977    142
1982    142
1987    142
1992    142
1997    142
2002    142
2007    142
Name: year, dtype: int64

## Column names

You can call [attributes](https://medium.com/@shawnnkoski/pandas-attributes-867a169e6d9b) of a Pandas variable by using "dot notation" - it's like a method, but without the parentheses. 

üí° **Tip**: Attributes are **features** of data. Methods **allow you to do something** with data. 

üí° **Tip**: A method is written with parenteses: e.g. `gap.value_counts()`. An attribute is written without parentheses: e.g. `gap.columns`.


In [20]:
# List the column names using the .columns *attribute*
df.columns

Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

üîî **Question**: Here's another popular attribute: `shape`. What do you think it does?

In [21]:
df.shape

(1704, 6)

## Slicing Rows

You can slice rows of a DataFrame like you would a string or a list. If we just want three rows: 

In [22]:
df[6:9]

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
6,Afghanistan,1982,12881816.0,Asia,39.854,978.011439
7,Afghanistan,1987,13867957.0,Asia,40.822,852.395945
8,Afghanistan,1992,16317921.0,Asia,41.674,649.341395


## `loc[]` and `.iloc[]`

`pandas` has two very popular methods to access data: `loc[]` and `iloc[]`. The difference is that one is **label-based**, and the other is **position-based**. 

What does that mean?

* `loc[]` looks for label names in your index (the leftmost, unnamed column in our Data Frame).
* `iloc[]` works much like accessing a list, where we use an integer to select a position in the list. 

üîî **Question**: Do you see the difference in the next two cells? What do you think is going on?

In [23]:
df.loc[6:9]

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
6,Afghanistan,1982,12881816.0,Asia,39.854,978.011439
7,Afghanistan,1987,13867957.0,Asia,40.822,852.395945
8,Afghanistan,1992,16317921.0,Asia,41.674,649.341395
9,Afghanistan,1997,22227415.0,Asia,41.763,635.341351


In [24]:
df.iloc[6:9]

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
6,Afghanistan,1982,12881816.0,Asia,39.854,978.011439
7,Afghanistan,1987,13867957.0,Asia,40.822,852.395945
8,Afghanistan,1992,16317921.0,Asia,41.674,649.341395


### Non-Consecutive Rows 
We can also use `.loc()` and `.iloc()` to return non-consecutive rows. Pass in **integers** as a double list. 

For example, to get the 4th, 12th, and 29th rows using `iloc[]`: 

In [25]:
df.iloc[[3, 11, 28]]

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
11,Afghanistan,2007,31889923.0,Asia,43.828,974.580338
28,Algeria,1972,14760787.0,Africa,54.518,4182.663766


<a id='iloc-select'></a>

### Rows and Columns

We can even pass in a second interior list to `iloc[]` to specify columns as well!

In [26]:
df.iloc[[3, 11, 28], [0,3]]

Unnamed: 0,country,continent
3,Afghanistan,Asia
11,Afghanistan,Asia
28,Algeria,Africa


While `.iloc()` requires integers, regular `.loc()` allows you to pass in column names:

In [27]:
df.loc[[3, 11, 28], ['country', 'year']]

Unnamed: 0,country,year
3,Afghanistan,1967
11,Afghanistan,2007
28,Algeria,1972


## Conditional Subsetting

What is we want a subset based on a condition? For example, what if we just wanted a subset for data only when country is equal to Egypt? 

In [28]:
df['country'] == 'Egypt'

0       False
1       False
2       False
3       False
4       False
        ...  
1699    False
1700    False
1701    False
1702    False
1703    False
Name: country, Length: 1704, dtype: bool

üí° **Tip**: Fancy terminology alert: this above Series is called a "Boolean mask". It's like a list of True/False labels that we can use to filter our Data Frame for a certain condition! *We'll cover this further in Python Fundamentals II.*

Here, we use `.loc[]` to subset our Data Frame *with the fancy Boolean mask ü™Ñ we just created*. 

In [31]:
# Data frame just of data points in Egypt...
e = df.loc[df['country'] == 'Egypt']
e

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
456,Egypt,1952,22223309.0,Africa,41.893,1418.822445
457,Egypt,1957,25009741.0,Africa,44.444,1458.915272
458,Egypt,1962,28173309.0,Africa,46.992,1693.335853
459,Egypt,1967,31681188.0,Africa,49.293,1814.880728
460,Egypt,1972,34807417.0,Africa,51.137,2024.008147
461,Egypt,1977,38783863.0,Africa,53.319,2785.493582
462,Egypt,1982,45681811.0,Africa,56.006,3503.729636
463,Egypt,1987,52799062.0,Africa,59.797,3885.46071
464,Egypt,1992,59402198.0,Africa,63.674,3794.755195
465,Egypt,1997,66134291.0,Africa,67.217,4173.181797


In [30]:
# Data frame just of 2002
am = df.loc[df["year"] == 2002]
am.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
10,Afghanistan,2002,25268405.0,Asia,42.129,726.734055
22,Albania,2002,3508512.0,Europe,75.651,4604.211737
34,Algeria,2002,31287142.0,Africa,70.994,5288.040382
46,Angola,2002,10866106.0,Africa,41.003,2773.287312
58,Argentina,2002,38331121.0,Americas,74.34,8797.640716


In [32]:
# Data frame that includes both South Africa as the destination AND Lesotho 
both = df.loc[(df['country'] == 'South Africa') | 
              (df['country'] == 'Lesotho')]
both

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
876,Lesotho,1952,748747.0,Africa,42.138,298.846212
877,Lesotho,1957,813338.0,Africa,45.047,335.997115
878,Lesotho,1962,893143.0,Africa,47.747,411.800627
879,Lesotho,1967,996380.0,Africa,48.492,498.639026
880,Lesotho,1972,1116779.0,Africa,49.767,496.581592
881,Lesotho,1977,1251524.0,Africa,52.208,745.369541
882,Lesotho,1982,1411807.0,Africa,55.078,797.263107
883,Lesotho,1987,1599200.0,Africa,57.18,773.993214
884,Lesotho,1992,1803195.0,Africa,59.685,977.486273
885,Lesotho,1997,1982823.0,Africa,55.558,1186.147994


Learn more by [reading the documentation here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) - what is the difference between `&` and `|` ?

üí° **Tip**: You can learn more about Pandas DataFrames in D-Lab's [**Python Data Wrangling**](https://github.com/dlab-berkeley/Python-Data-Wrangling) workshop. [Register now](https://dlab.berkeley.edu/training/upcoming-workshops).


## Creating a new Column

To create a new column, use the [] brackets with the new column name at the left side of the assignment. Note that we can just throw in another column which we do some calculations on:

In [33]:
df['lifeExp_rounded'] = df['lifeExp'].round()
df.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap,lifeExp_rounded
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314,29.0
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303,30.0
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071,32.0
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138,34.0
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106,36.0


<a id='project'></a>

# üöÄ Project

Time for your first data science project in Python!

* Time: 30 minutes.
* The helper will be in a **breakout room**: Join them if you want to work there, or ask questions about the code!
* The instructor will stay in the **main room**: Feel free to ask questions here as well.

### Data: Music reviews

Our dataset will consist of music reviews. It consists of a range of music albums and their reviews by different review magazines. 

This dataset is separated by tab breaks instead of commas. However, tab separated files can be stored in a .csv file just the same - we just need to add the `"\t"` argument to the `sep = ` parameter.

In [44]:
import pandas as pd
music = pd.read_csv('../../data/music_reviews.csv', sep = '\t')
music.head()

Unnamed: 0,album,artist,genre,release_date,critic,score,body
0,Don't Panic,All Time Low,Pop/Rock,2012-10-09 00:00:00,Kerrang!,74.0,While For Baltimore proves they can still writ...
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future‚Äôs Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...


## ü•ä Challenge: `.describe()`

Use the `.describe()` method to find the average review score for all albums in this dataset.

In [35]:
## YOUR CODE HERE

music.describe()
# mean = 72.684223

Unnamed: 0,score
count,5001.0
mean,72.684223
std,8.714896
min,7.4
25%,68.0
50%,74.0
75%,79.0
max,100.0


Rename the "album" column name to "ALBUM" and the "body" column name to "Body".

In [45]:
## YOUR CODE HERE

music.rename(columns={'album': 'ALBUM', 'body': 'Body'}, inplace=True)

In [46]:
# See if it worked
music.head()

Unnamed: 0,ALBUM,artist,genre,release_date,critic,score,Body
0,Don't Panic,All Time Low,Pop/Rock,2012-10-09 00:00:00,Kerrang!,74.0,While For Baltimore proves they can still writ...
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future‚Äôs Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...


## ü•ä Challenge: Slice It! 

Slice music to return just the 3rd, 4th, and 5th reviews (note that they start on 0).

In [52]:
## Using loc
music.loc[2:4]

Unnamed: 0,ALBUM,artist,genre,release_date,critic,score,Body
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future‚Äôs Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...


In [53]:
## Using iloc
music.iloc[2:5]

Unnamed: 0,ALBUM,artist,genre,release_date,critic,score,Body
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future‚Äôs Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...


In [54]:
## Using slicing
music[2:5]

Unnamed: 0,ALBUM,artist,genre,release_date,critic,score,Body
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future‚Äôs Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...


## ü•ä Challenge: `iloc`

Use `.iloc` to return just reviews at index 1, 82, 988, 4002. Also return the last review - without looking up what the last index is!.

In [57]:
## YOUR CODE HERE
music.iloc[[1,82,988,4002,-1]]


Unnamed: 0,ALBUM,artist,genre,release_date,critic,score,Body
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...
82,Neon Bible,Arcade Fire,Indie,2007-03-06 00:00:00,Prefix Magazine,87.0,Some Funeral devotees may be disappointed by t...
988,Until The Earth Begins To Part,Broken Records,Indie,2009-07-07 00:00:00,AllMusic,64.0,"As far as debut albums go, Until the Earth Beg..."
4002,Reptile,Eric Clapton,Rock,2001-03-13 00:00:00,Rolling Stone,66.0,"Over the course of fourteen tracks, Clapton bl..."
5000,And Their Refinement Of The Decline,Stars Of The Lid,Rock,2007-04-07 00:00:00,PopMatters,87.0,"Their work, especially that displayed on Refin..."


## ü•ä Challenge: `iloc` Slice and Stride

Use `.iloc` to return every 40th album between reviews 20 and 200 (including the upper bound).

In [58]:
## YOUR CODE HERE

music.iloc[20:200:40]

Unnamed: 0,ALBUM,artist,genre,release_date,critic,score,Body
20,July Flame,Laura Veirs,Indie,2010-01-12 00:00:00,musicOMH.com,81.0,Laura Veirs makes an excellent case for hersel...
60,Angels & Devils,The Bug,Electronic,2014-08-26 00:00:00,The Quietus,83.0,An album that's otherwise remarkably deft at u...
100,Pilot Talk III,Curren$y,Rap,2015-04-04 00:00:00,bossmanrwat11,78.0,"Big Spitta fan here, but his remarkable consis..."
140,A Thousand Shark's Teeth,My Brightest Diamond,Indie,2008-06-17 00:00:00,PopMatters,72.0,"It‚Äôs a swooning, big-gestured album to get los..."
180,Television Man,Naomi Punk,Pop/Rock,2014-08-05 00:00:00,Rolling Stone,71.0,"Sure, the songs' formulaic stop-and-spurt atta..."


## ü•ä Challenge: `iloc` Rows and Columns

Use `iloc` to extract the last five rows in the music data set but only the "artist", "ALBUM", and "score" columns - in that order!

üí° **Tip**: Remember how to select columns using `iloc`? If not, go back [here](#iloc-select).

In [61]:
## YOUR CODE HERE

music.iloc[-5:,[1,0,5]]

Unnamed: 0,artist,ALBUM,score
4996,Conor Oberst And The Mystic Valley Band,Outer South,67.0
4997,David Gilmour,On An Island,67.0
4998,Gossip,Movement,81.0
4999,Dr. John,Locked Down,86.0
5000,Stars Of The Lid,And Their Refinement Of The Decline,87.0


## ü•ä Challenge: `.loc` Rows and Columns 

Use `loc` to extract the first 5 rows in your DataFrame ‚Äì but only the "artist", "ALBUM", and "score" columns - in that order!

In [65]:
## YOUR CODE HERE

music.loc[:4, ['artist','ALBUM','score']]

Unnamed: 0,artist,ALBUM,score
0,All Time Low,Don't Panic,74.0
1,Ryan Bingham,Fear and Saturday Night,70.0
2,Lee Ann Womack,The Way I'm Livin',84.0
3,Earl Sweatshirt,Doris,82.0
4,Echoboy,Giraffe,71.0


## ü•ä Challenge: Create New Column

Create a new column in your DataFrame named "score_int" that is a copy of "score" - but converted to an integer. 

üí° **Tip**: Look up a Pandas method to change the datatype of a column!

In [68]:
## YOUR CODE HERE

music['score_int'] = music['score'].astype(int)

## ü•ä Challenge: Create New Column with Calculation

Create a new column in your DataFrame named "decimal" that is a copy of "score_int" but that is divided by 100.

In [69]:
## YOUR CODE HERE

music['decimal'] = music['score_int'] / 100

## ü•ä Challenge: `.value_counts()`

Use `.value_counts()` to sum the total number of reviews for each critic. Return only the top 20. 

üí° **Tip**: Have another look at the original DataFrame if you need to remind yourself which column can be useful here!

In [78]:
## YOUR CODE HERE

music['critic'].value_counts()[:20]

AllMusic                     282
PopMatters                   228
Pitchfork                    207
Q Magazine                   178
Uncut                        171
Mojo                         137
Drowned In Sound             132
New Musical Express (NME)    127
The A.V. Club                121
Rolling Stone                112
Under The Radar              100
Spin                          97
The Guardian                  96
musicOMH.com                  88
Entertainment Weekly          87
Slant Magazine                83
Paste Magazine                72
Alternative Press             69
Consequence of Sound          69
Prefix Magazine               68
Name: critic, dtype: int64

## ü•ä <span style="color:red">Bonus Challenge:</span> `.value_counts()`

If you are done and still have time, go have a look at the [`groupby()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html). Try to use it in order to answer the following question:

#### *Which genre has the highest score?*

üí° **Tip**: 
* Use the `.groupby()` method on your DataFrame to sort it by `genre`. Save it in a variable.
* Subset this variable (use the square brackets!) to get the `score` column.
* Use the `.mean()` method on this subset to get the **average score** per genre.


In [80]:
# YOUR CODE HERE
genre = music.groupby('genre')
genre['score'].mean()

genre
Alternative/Indie Rock    73.928571
Country                   74.071429
Dance                     70.146341
Electronic                73.140351
Folk                      75.900000
Indie                     74.400897
Jazz                      77.631579
Pop                       64.608054
Pop/Rock                  73.033782
R&B;                      72.366071
Rap                       72.173554
Rock                      70.754292
Name: score, dtype: float64

# üéâ Well done!

**This concludes Python Fundamentals I!**

Today's project took us through importing multiple csv files, data manipulation, and some basic visualizations and analysis of data. 

If you were working on this dataset, what would you potentially do next? It could be either an analysis, a new feature to include, a visualization that might help represent the data, etc.

### üí° Tip: More workshops!

D-Lab teaches workshops that allow you to practice more with DataFrames and visualization.

- To learn more about data wrangling, check out D-Lab's [Python Data Wrangling workshop](https://github.com/dlab-berkeley/Python-Data-Wrangling).
- To learn more about data visualization, check out D-Lab's [Python Data Visualization workshop](https://github.com/dlab-berkeley/Python-Data-Visualization).