# Introduction

This script was created as part of my studies in data science from the course **Data Manipulation with pandas** available at Data Camp <https://learn.datacamp.com/courses/data-manipulation-with-pandas>.

# Parte 1: Data Frames

- `.head()` **method** is used ato quickliy explore and get a sense of the content of the data frame.
- `.info()` **method**, displays the names of columns, the data types they contain, and wheather they have any missing values.
- `.shape` **attribute** gives us the number of rows and columns. Since it is an attribute, it doesn't require parentheses.
- `.describe()` **method** give us statistical summaries for numerical data, such as count, mean, std, etc.
- `.columns` **attribute**: column names
- `.index` **attribute**: row names

# Exercises Part 1

## 1. Print the head of the data

In [1]:
import pandas as pd

In [2]:
homelessness = pd.read_csv('homelessness.csv')

In [3]:
homelessness.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
0,0,East South Central,Alabama,2570.0,864.0,4887681
1,1,Pacific,Alaska,1434.0,582.0,735139
2,2,Mountain,Arizona,7259.0,2606.0,7158024
3,3,West South Central,Arkansas,2280.0,432.0,3009733
4,4,Pacific,California,109008.0,20964.0,39461588


## 2. Print information about the data

In [4]:
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      51 non-null     int64  
 1   region          51 non-null     object 
 2   state           51 non-null     object 
 3   individuals     51 non-null     float64
 4   family_members  51 non-null     float64
 5   state_pop       51 non-null     int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 2.5+ KB


## 3. Print the shape of the data

In [5]:
homelessness.shape

(51, 6)

The homelessness data frame contains:
- 51 rows
- 6 columns

## 4. Print a description of the data frame

In [6]:
homelessness.describe()

Unnamed: 0.1,Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0,51.0
mean,25.0,7225.784314,3504.882353,6405637.0
std,14.866069,15991.025083,7805.411811,7327258.0
min,0.0,434.0,75.0,577601.0
25%,12.5,1446.5,592.0,1777414.0
50%,25.0,3082.0,1482.0,4461153.0
75%,37.5,6781.5,3196.0,7340946.0
max,50.0,109008.0,52070.0,39461590.0


## 5. Print the values of homelessness

In [7]:
homelessness.values

array([[0, 'East South Central', 'Alabama', 2570.0, 864.0, 4887681],
       [1, 'Pacific', 'Alaska', 1434.0, 582.0, 735139],
       [2, 'Mountain', 'Arizona', 7259.0, 2606.0, 7158024],
       [3, 'West South Central', 'Arkansas', 2280.0, 432.0, 3009733],
       [4, 'Pacific', 'California', 109008.0, 20964.0, 39461588],
       [5, 'Mountain', 'Colorado', 7607.0, 3250.0, 5691287],
       [6, 'New England', 'Connecticut', 2280.0, 1696.0, 3571520],
       [7, 'South Atlantic', 'Delaware', 708.0, 374.0, 965479],
       [8, 'South Atlantic', 'District of Columbia', 3770.0, 3134.0,
        701547],
       [9, 'South Atlantic', 'Florida', 21443.0, 9587.0, 21244317],
       [10, 'South Atlantic', 'Georgia', 6943.0, 2556.0, 10511131],
       [11, 'Pacific', 'Hawaii', 4131.0, 2399.0, 1420593],
       [12, 'Mountain', 'Idaho', 1297.0, 715.0, 1750536],
       [13, 'East North Central', 'Illinois', 6752.0, 3891.0, 12723071],
       [14, 'East North Central', 'Indiana', 3776.0, 1482.0, 6695497],
    

## 6. Print the column index of homelessness

In [8]:
print(homelessness.columns)

Index(['Unnamed: 0', 'region', 'state', 'individuals', 'family_members',
       'state_pop'],
      dtype='object')


## 7. Print the row index of homelessness

In [9]:
homelessness.index

RangeIndex(start=0, stop=51, step=1)

# Part 2: Sorting and subsetting

To sort elements in our data frame, we use the following structure:
    
- `data_frame.sort_values("column_name")`
- `data_frame.sort_values("column_name", ascending = False)`

## Sorting by multiples variables

- `data_frame.sort_values(["var1", "var2"])`
- `data_frame.sort_values(["var1", "var2"], ascending=[True, False])`

## Selecting columns

- `data_frame["column_name"]`
- `data_frame["column_name1", "column_name2"]` for multiple columns

## Subsetting rows


- `data_frame["col_name"] > 50`
- `data_frame[data_frame["col_name"] > 50]`

## Subsetting based on text data

- `data[data["col_name"] == "Name"]`

## Subsetting based on dates

- `dogs[dogs["date_of_birth"] > "2015-01-01"]`

## Subsetting base on multiple conditions

`is_lab = dogs["breed"] == "Labrador"`

`is_brown = dogs["color"] == 'Brown'`

`dogs[is_lab & is_brown]`

- or we can do all in one line:
    `dogs[(dogs["breed"] == "Labrador") & (dogs["color"] == 'Brown')]`

## Subsetting using `.isin()`

`is_black_or_brown = dogs["color"].isin(["Black", "Brown"])`
`dogs[is_black_or_brown]`

# Exercises Part 2

## 1. Sort homelessness by the number of homeless individuals, from smallest to largest, and save this as `homelessness_ind`. Print the head of the sorted DataFrame.

In [10]:
homelessness_ind = homelessness.sort_values("individuals")

In [11]:
homelessness_ind.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
50,50,Mountain,Wyoming,434.0,205.0,577601
34,34,West North Central,North Dakota,467.0,75.0,758080
7,7,South Atlantic,Delaware,708.0,374.0,965479
39,39,New England,Rhode Island,747.0,354.0,1058287
45,45,New England,Vermont,780.0,511.0,624358


## 2. Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam. Print the head of the sorted DataFrame.

In [12]:
homelessness_fam = homelessness.sort_values("family_members", ascending = False)

In [13]:
homelessness_fam.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
4,4,Pacific,California,109008.0,20964.0,39461588
21,21,New England,Massachusetts,6811.0,13257.0,6882635
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
43,43,West South Central,Texas,19199.0,6111.0,28628666


## 3. Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam. Print the head of the sorted DataFrame.

In [14]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region", "family_members"], ascending = [True, False])

# Print the top few rows
homelessness_reg_fam.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
13,13,East North Central,Illinois,6752.0,3891.0,12723071
35,35,East North Central,Ohio,6929.0,3320.0,11676341
22,22,East North Central,Michigan,5209.0,3142.0,9984072
49,49,East North Central,Wisconsin,2740.0,2167.0,5807406
14,14,East North Central,Indiana,3776.0,1482.0,6695497


## 4. Subsetting columns

### 4.1 Create a DataFrame called individuals that contains only the individuals column of homelessness.
Print the head of the result.

In [15]:
# Select the individuals column
individuals = homelessness["individuals"]

# Print the head of the result
print(individuals.head())

0      2570.0
1      1434.0
2      7259.0
3      2280.0
4    109008.0
Name: individuals, dtype: float64


### 4.2 Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.
Print the head of the result.

In [16]:
# Select the state and family_members columns
state_fam = homelessness[["state", "family_members"]]

# Print the head of the result
print(state_fam.head())

        state  family_members
0     Alabama           864.0
1      Alaska           582.0
2     Arizona          2606.0
3    Arkansas           432.0
4  California         20964.0


### 4.3 Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.
Print the head of the result.

In [17]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals", "state"]]

# Print the head of the result
print(ind_state.head())

   individuals       state
0       2570.0     Alabama
1       1434.0      Alaska
2       7259.0     Arizona
3       2280.0    Arkansas
4     109008.0  California


### 4.4 Subsetting rows

- Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k. View the printed result.

In [18]:
ind_gt_10k = homelessness[homelessness["individuals"] > 10000]

In [19]:
ind_gt_10k

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
4,4,Pacific,California,109008.0,20964.0,39461588
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,37,Pacific,Oregon,11139.0,3337.0,4181886
43,43,West South Central,Texas,19199.0,6111.0,28628666
47,47,Pacific,Washington,16424.0,5880.0,7523869


- Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg. View the printed result.

In [20]:
mountain_reg = homelessness[homelessness["region"] == "Mountain"]

In [21]:
mountain_reg

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
2,2,Mountain,Arizona,7259.0,2606.0,7158024
5,5,Mountain,Colorado,7607.0,3250.0,5691287
12,12,Mountain,Idaho,1297.0,715.0,1750536
26,26,Mountain,Montana,983.0,422.0,1060665
28,28,Mountain,Nevada,7058.0,486.0,3027341
31,31,Mountain,New Mexico,1949.0,602.0,2092741
44,44,Mountain,Utah,1904.0,972.0,3153550
50,50,Mountain,Wyoming,434.0,205.0,577601


- Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific", assigning to fam_lt_1k_pac. View the printed result.

In [22]:
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) & (homelessness["region"] == "Pacific")]

In [23]:
fam_lt_1k_pac

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
1,1,Pacific,Alaska,1434.0,582.0,735139


### 4.5 Subsetting rows by categorical variables

In [24]:
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[homelessness["region"].isin(["South Atlantic", "Mid-Atlantic"])]

# See the result
print(south_mid_atlantic)

    Unnamed: 0          region                 state  individuals  \
7            7  South Atlantic              Delaware        708.0   
8            8  South Atlantic  District of Columbia       3770.0   
9            9  South Atlantic               Florida      21443.0   
10          10  South Atlantic               Georgia       6943.0   
20          20  South Atlantic              Maryland       4914.0   
30          30    Mid-Atlantic            New Jersey       6048.0   
32          32    Mid-Atlantic              New York      39827.0   
33          33  South Atlantic        North Carolina       6451.0   
38          38    Mid-Atlantic          Pennsylvania       8163.0   
40          40  South Atlantic        South Carolina       3082.0   
46          46  South Atlantic              Virginia       3928.0   
48          48  South Atlantic         West Virginia       1021.0   

    family_members  state_pop  
7            374.0     965479  
8           3134.0     701547  
9     

- **Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness. View the printed result.**

In [25]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
print(mojave_homelessness)

    Unnamed: 0    region       state  individuals  family_members  state_pop
2            2  Mountain     Arizona       7259.0          2606.0    7158024
4            4   Pacific  California     109008.0         20964.0   39461588
28          28  Mountain      Nevada       7058.0           486.0    3027341
44          44  Mountain        Utah       1904.0           972.0    3153550


# Part 3: New columns

- **Example:**

`dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"]**2`

## Multiple manipulations

- Filtering for skinny dogs:

`bmi_lt_100 = dogs[dogs["bmi"] < 100]` 

- Sorting in descending order for height: 

`bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending = False)`

- Keeping only the columns of interest:

`bmi_lt_100_height[["name", "height_cm", "bmi"]]` 

# Exercises Part 3

## 1. Adding new columns

- Add a new column to homelessness, named total, containing the sum of the individuals and family_members columns.
- Add another column to homelessness, named p_individuals, containing the proportion of homeless people in each state who are individuals.

In [26]:
# Add total col as sum of individuals and family_members
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]

# Add p_individuals col as proportion of individuals
homelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"] 

# See the result
print(homelessness)

    Unnamed: 0              region                 state  individuals  \
0            0  East South Central               Alabama       2570.0   
1            1             Pacific                Alaska       1434.0   
2            2            Mountain               Arizona       7259.0   
3            3  West South Central              Arkansas       2280.0   
4            4             Pacific            California     109008.0   
5            5            Mountain              Colorado       7607.0   
6            6         New England           Connecticut       2280.0   
7            7      South Atlantic              Delaware        708.0   
8            8      South Atlantic  District of Columbia       3770.0   
9            9      South Atlantic               Florida      21443.0   
10          10      South Atlantic               Georgia       6943.0   
11          11             Pacific                Hawaii       4131.0   
12          12            Mountain                 

## 2. Combo-attack!

- Add a column to homelessness, indiv_per_10k, containing the number of homeless individuals per ten thousand people in each state.
- Subset rows where indiv_per_10k is higher than 20, assigning to high_homelessness.
- Sort high_homelessness by descending indiv_per_10k, assigning to high_homelessness_srt.
- Select only the state and indiv_per_10k columns of high_homelessness_srt and save as result. Look at the result.

In [28]:
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result
print(result)

                   state  indiv_per_10k
8   District of Columbia      53.738381
11                Hawaii      29.079406
4             California      27.623825
37                Oregon      26.636307
28                Nevada      23.314189
47            Washington      21.829195
32              New York      20.392363
