In [2]:
import bz2
import pickle
import pandas as pd
import numpy as np

In [7]:
sales = pd.DataFrame()

In [8]:
archived = bz2.BZ2File("walmart_sales.pkl.bz2", "r")

In [9]:
sales = pickle.load(archived)

In [10]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 413119 entries, 0 to 413118
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   store                 413119 non-null  int64         
 1   type                  413119 non-null  object        
 2   department            413119 non-null  int32         
 3   date                  413119 non-null  datetime64[ns]
 4   weekly_sales          413119 non-null  float64       
 5   is_holiday            413119 non-null  bool          
 6   temperature_c         413119 non-null  float64       
 7   fuel_price_usd_per_l  413119 non-null  float64       
 8   unemployment          413119 non-null  float64       
dtypes: bool(1), datetime64[ns](1), float64(4), int32(1), int64(1), object(1)
memory usage: 27.2+ MB


# A bit of theory

``` .set_index(columnname) ```

```.reset_index()``` < will be by default

```.reset_index(drop=True)``` old index will be discarded

### indexes make subsetting simpler
#### Not really simple here

``` dogs[dogs["name"].isin(["One","Two"])] ```  
will filter names that are in the list under ```isin```

#### same with index

``` dogs_ind.lloc[["One","Two"]] ```

## Multilevel indexes

``` dogs_ind3 = dogs.set_index(["breed", "color"]) ```
### subsetting multilevel indexes

``` dogs_ind3.loc[["Labrador", "Chihuahua"]] ```

### Subsetting inner level

Need to use list of tuples:
1. Tuple will be used for outer, inner
2. Tuple will be used for outer, inner

``` dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]] ```

## sorting indexes

``` somedf.sort_index()  ``` < indexes will be sorted

By default:
- from outer to inner
- descending

### tuning the sorting

``` somedf.sort_index(level=["outer", "inner"], ascending=[True, False]) ```


## Setting & removing indexes
pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

In this chapter, you'll be exploring temperatures, a DataFrame of average temperatures in cities around the world. pandas is loaded as pd.
- Look at temperatures.
- Set the index of temperatures to "city", assigning to temperatures_ind.
- Look at temperatures_ind. How is it different from temperatures?
- Reset the index of temperatures_ind, keeping its contents.
- Reset the index of temperatures_ind, dropping its contents.

In [None]:
# Look at temperatures
print(temperatures)

# Index temperatures by city
temperatures_ind = temperatures.set_index("city")

# Look at temperatures_ind
print(temperatures_ind)

# Reset the index, keeping its contents
print(temperatures_ind.reset_index())

# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop=True))

### Subsetting with .loc[]
The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

- Create a list of cities to subset on: Moscow and Saint Petersburg. Assign to cities.
- Use [] subsetting to filter temperatures for rows where the city column takes a value in cities.
- Use .loc[] subsetting to filter temperatures_ind for rows where the city is in cities.

In [None]:
# Make a list of cities to subset on
cities = ["Moscow","Saint Petersburg"]

# Subset temperatures using square brackets
print(temperatures[temperatures["city"].isin(cities)])

# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])

## Setting multi-level indexes
Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside country.

The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes, and keep track of how your data is represented.

- Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind.
- Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.
- Print and subset temperatures_ind for rows_to_keep using .loc[].

In [None]:
# Index temperatures by country & city
temperatures_ind = temperatures.set_index(["country", "city"])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil", "Rio De Janeiro") , ("Pakistan", "Lahore")]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])

## Sorting by index values
Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It's also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

- Sort temperatures_ind by the index values.
- Sort temperatures_ind by the index values at the "city" level.
- Sort temperatures_ind by ascending country then descending city.

In [None]:
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level="city"))

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=["country","city"], ascending=[True, False]))

## Slicing index values
Slicing lets you select consecutive elements of an object using first:last syntax. DataFrames can be sliced by index values, or by row/column number; we'll start with the first case. This involves slicing inside the .loc[] method.

Compared to slicing lists, there are a few things to remember.

You can only slice an index if the index is sorted (using .sort_index()).
To slice at the outer level, first and last can be strings.
To slice at inner levels, first and last should be tuples.
If you pass a single slice to .loc[], it will slice the rows.
- Sort the index of temperatures_ind.
- Use slicing with .loc[] to get these subsets:
- from Pakistan to Russia.
- from Lahore to Moscow. (This will return nonsense.)
- from Pakistan, Lahore to Russia, Moscow.