# Data Manipulation with pandas

👋 Welcome to your new **workspace**! Here, you can experiment with the data you used in [Data Manipulation with pandas](https://app.datacamp.com/learn/courses/data-manipulation-with-pandas) and practice your newly learned skills with some challenges. You can find out more about DataCamp Workspace [here](https://workspace-docs.datacamp.com/).

On average, we expect users to take approximately **30 minutes** to complete the content in this workspace. However, you are free to experiment and practice in it as long as you would like!

## 1. Get Started
Below is a code cell. It is used to execute Python code. The code below imports three packages you used in Data Manipulation with pandas: pandas, NumPy, and Matplotlib. The code also imports data you used in the course as DataFrames using the pandas [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

🏃**To execute the code, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.**

In [None]:
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Import the four files
avocados = pd.read_csv('files/avocado.csv')
homelessness = pd.read_csv('files/homelessness.csv')
temperatures = pd.read_csv('files/temperatures.csv')
walmart = pd.read_csv('files/walmart.csv')

## 2. Write Code
After running the cell above, you have created four pandas DataFrames: `avocado`, `homelessness`, `temperatures`, and `walmart`. 
​
**Add code** to the code cells below to try one (or more) of the following challenges:
​
1. Print the highest weekly sales for each `department` in the `walmart` DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this [video](https://campus.datacamp.com/courses/data-manipulation-with-pandas/aggregating-dataframes?ex=1).
2. What was the total `nb_sold` of organic avocados in 2017 in the `avocado` DataFrame? If you're stuck, try reviewing this [video](https://campus.datacamp.com/courses/data-manipulation-with-pandas/slicing-and-indexing-dataframes?ex=6).
3. Create a bar plot of the total number of homeless people by region in the `homelessness` DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this [video](https://campus.datacamp.com/courses/data-manipulation-with-pandas/creating-and-visualizing-dataframes?ex=1).
4. Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this [video](https://campus.datacamp.com/courses/data-manipulation-with-pandas/creating-and-visualizing-dataframes?ex=1).
​
Be sure to check out the **Answer Key** at the end to see one way to solve each problem. Did you try something similar?
​
**Reminder: To execute the code you add to a cell, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.**

# MODULE (1) Transforming DataFrames

## Inspecting a DataFrame
When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

**.head()** returns the first few rows (the “head” of the DataFrame). <br>
**.info()** shows information on each of the columns, such as the data type and number of missing values. <br>
**.shape** returns the number of rows and columns of the DataFrame. <br>
**.describe()** calculates a few summary statistics for each column. <br>
homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.

In [None]:
homelessness.head()
homelessness.info()
homelessness.shape
homelessness.describe()

## Parts of a DataFrame
To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

**.values:** A two-dimensional NumPy array of values. <br>
**.columns:** An index of columns: the column names. <br>
**.index:** An index for the rows: either row numbers or row names. <br>
You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options. (These will be covered later in the course.)

### Sorting rows <br>
Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to **.sort_values().**

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

Sort on …	Syntax
one column	df.sort_values("breed")

multiple columns	df.sort_values(["breed", "weight_kg"])

By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".

>**Excerise** <br>
>1. Create a DataFrame called individuals that contains only the individuals column of homelessness.
>Print the head of the result.
>homelessness is available and pandas is loaded as pd.
>2. Sort homelessness by the number of homeless individuals, from smallest to largest, and save this as homelessness_ind.
>Print the head of the sorted DataFrame.
>3. Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam.
>Print the head of the sorted DataFrame.
>4. Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as >homelessness_reg_fam.
>Print the head of the sorted DataFrame.

In [None]:
# No 2: Sort homelessness by individuals
homelessness_ind = homelessness.sort_values("individuals")

# Print the top few rows
homelessness_ind.head()

In [None]:
# No 3: Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values("family_members", ascending = False )

# Print the top few rows
homelessness_fam.head()

In [None]:
# N0 4: Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region","family_members"], ascending = [True,False])

# Print the top few rows
homelessness_reg_fam.head()

### Subsetting columns
When working with data, you may not need all of the variables in your dataset. Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you. To select only "col_a" of the DataFrame df, use

df["col_a"]<br>
To select "col_a" and "col_b" of df, use

df[["col_a", "col_b"]] <br>
homelessness is available and pandas is loaded as pd.

**Excerise**<br>
1. Create a DataFrame called individuals that contains only the individuals column of homelessness.
Print the head of the result.
2. Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.
Print the head of the result.
3. Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.
Print the head of the result.

In [None]:
# Select the individuals column
individuals = homelessness['individuals']

# Print the head of the result
individuals.head()

In [None]:
# Select the state and family_members columns
state_fam = homelessness[["state","family_members"]]

# Print the head of the result
state_fam.head()


In [None]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals","state"]]

# Print the head of the result
print(ind_state.head())

### Subsetting rows
A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]

dogs[dogs["color"] == "tan"]

You can filter for multiple conditions at once by using the "bitwise and" operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]
homelessness is available and pandas is loaded as pd.

1. Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k. View the printed result.

2. Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg. View the printed result.

3. Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific", assigning to fam_lt_1k_pac. View the printed result.

In [None]:
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness["individuals"]>10000]

# See the result
ind_gt_10k.value_counts()
ind_gt_10k

In [None]:
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness["region"] == "Mountain"]

# See the result
print(mountain_reg)

In [None]:
# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness["family_members"]<1000) & (homelessness["region"] == "Pacific")]

# See the result
print(fam_lt_1k_pac)

### Subsetting rows by categorical variables
Subsetting data based on a categorical variable often involves using the "or" operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

colors = ["brown", "black", "tan"]
condition = dogs["color"].isin(colors)
dogs[condition]
homelessness is available and pandas is loaded as pd.

In [None]:
'''
Filter homelessness for cases where the USA census region is "South Atlantic" 
or it is "Mid-Atlantic", assigning to south_mid_atlantic. View the printed result.
'''
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[homelessness['region'].isin(["South Atlantic", "Mid-Atlantic"])]

# See the result
south_mid_atlantic

In [None]:
'''
Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, 
assigning to mojave_homelessness. 
View the printed result.'''
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
print(mojave_homelessness)

## - NEW COLUMNS

# MODULE (3): Slicing and Indexing DataFrames

## **Explicit Index**

### **Setting and removing indexes**


***Excerise***<br> 
pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

In this chapter, you'll be exploring temperatures, a DataFrame of average temperatures in cities around the world. [pandas] is loaded as [pd].

***Instructions***

100 XP
Look at temperatures.<br>
    1. Set the index of temperatures to "city", assigning to temperatures_ind. <br>
    2. Look at temperatures_ind. How is it different from temperatures?<br>
    3. Reset the index of temperatures_ind, keeping its contents.<br>
    4. Reset the index of temperatures_ind, dropping its contents.


In [None]:
### Import Libraries 
import pandas as pd
from matplotlib import pyplot as plt

In [None]:
#Always having an idea of your dataset
temperatures.head()

In [None]:
# Look at temperatures
print(temperatures['city'])
#Look at Unique values for city column
#print(temperatures['city']).unique()

# Index temperatures by city
temperatures_ind = temperatures.set_index('city')

# Look at temperatures_ind
temperatures_ind.head()

# Reset the index, keeping its contents
print(temperatures_ind.reset_index()).head()

# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop=True))

### **Subsetting with .loc[]***

***Exercise***<br>
The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

pandas is loaded as pd. temperatures and temperatures_ind are available; the latter is indexed by city.

***Instructions***
1. Create a list called cities that contains "Moscow" and "Saint Petersburg".
2. Use [] subsetting to filter temperatures for rows where the city column takes a value in the cities list.
3. Use .loc[] subsetting to filter temperatures_ind for rows where the city is in the cities list.

In [None]:
# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]

# Subset temperatures using square brackets
(temperatures[temperatures["city"].isin(cities)]).head()



In [None]:
# Subset temperatures_ind using .loc[]
(temperatures_ind.loc[cities]).head()

### **Setting multi-level indexes**

***Exercise***<br>
Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial, you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside the treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside the country.

The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes and keep track of how your data is represented.

pandas is loaded as pd. temperatures is available.

***Instructions***<br>
100 XPM<br>
1. Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind.
2. Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.
3. Print and subset temperatures_ind for rows_to_keep using .loc[].

In [None]:
"""1) Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind"""
# Index temperatures by country & city
temperatures_ind = temperatures.set_index(["country", "city"])

"""2) Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep."""
# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [['Brazil', 'Rio De Janeiro'], ['Pakistan', 'Lahore']]

"""3). Print and subset temperatures_ind for rows_to_keep using .loc[]."""
# Subset for rows to keep
(temperatures_ind.loc[rows_to_keep]).head()

### **Sorting by index values**

***Exercise*** <br>
Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It's also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

pandas is loaded as pd. temperatures_ind has a multi-level index of country and city, and is available.

***Instructions*** <br>
100 XP <br>
1. Sort temperatures_ind by the index values.
2. Sort temperatures_ind by the index values at the "city" level.
3. Sort temperatures_ind by ascending country then descending city.

In [None]:
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index().head())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level=['cities']).head())

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level =['country', 'city'], ascending=[True, False]).head())

## **Slicing and subsetting with .loc and .iloc**

### **Slicing index values**

***Exercise***<br>
Slicing index values<br>
Slicing lets you select consecutive elements of an object using first:last syntax. DataFrames can be sliced by index values or by row/column number; we'll start with the first case. This involves slicing inside the .loc[] method.

Compared to slicing lists, there are a few things to remember.

>You can only slice an index if the index is sorted (using .sort_index()).<br>
>To slice at the outer level, first and last can be strings.<br>
>To slice at inner levels, first and last should be tuples.<br>
>If you pass a single slice to .loc[], it will slice the rows.<br>
pandas is loaded as pd. temperatures_ind has country and city in the index, and is available.

***Instructions*** <br>
100 XP<br>
1. Sort the index of temperatures_ind.
2. Use slicing with .loc[] to get these subsets:
- from Pakistan to Russia.
- from Lahore to Moscow. (This will return nonsense.)
- from Pakistan, Lahore to Russia, Moscow.

In [None]:
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Subset rows from Pakistan to Russia
print(temperatures_srt.loc['Pakistan':'Russia' ])

# Try to subset rows from Lahore to Moscow
print(temperatures_srt.loc['Lahore':'Moscow'])

# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[('Pakistan','Lahore'):('Russia', 'Moscow')])

### **Slicing in both directions**

***Exercise***<br>
You've seen slicing DataFrames by rows and by columns, but since DataFrames are two-dimensional objects, it is often natural to slice both dimensions at once. That is, by passing two arguments to .loc[], you can subset by rows and columns in one go.

pandas is loaded as pd. temperatures_srt is indexed by country and city, has a sorted index, and is available.

***Instructions***<br>
100 XP
>1. Use .loc[] slicing to subset rows from India, Hyderabad to Iraq, Baghdad.
>2. Use .loc[] slicing to subset columns from date to avg_temp_c.
>3. Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c.



Slice rows with code like <br>  
>df.loc[("a", "b"): ("c", "d")].<br>

Slice columns with code like <bf>   
>df.loc[:, "e":"f"].<br>

Slice both ways with code like <br>
>df.loc[("a", "b"): ("c", "d"), "e":"f"]

In [None]:
temperatures_srt.head()
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[('India', 'Hyderabad'):('Iraq','Baghdad')])

# Subset columns from date to avg_tmp_c
print(temperatures_srt.loc[:, 'date':'avg_temp_c'])

# Subset in both directions at once
print(temperatures_srt.loc[('India', 'Hyderabad'):('Iraq','Baghdad'),'date':'avg_temp_c'])

The most common ways to subset rows are the ways we’ve previously discussed: using a Boolean condition or by index labels. However, it is also occasionally useful to pass row numbers.

This is done using .iloc[], and like .loc[], it can take two arguments to let you subset by rows and columns.

pandas is loaded as pd. temperatures (without an index) is available.

Use .iloc[] on temperatures to take subsets.

Get the 23rd row, 2nd column (index positions 22 and 1).

In [52]:
temperatures_ind.pivot_table()

ValueError: No group keys passed!