### **Inspecting a DataFrame**

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. 
There are several useful methods and attributes for this.

-`.head()` returns the first few rows (the “head” of the DataFrame).

-`.info()` shows information on each of the columns, such as the data type and number of missing values.

-`.shape` returns the number of rows and columns of the DataFrame.

-`.describe()` calculates a few summary statistics for each column.
        
homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The `state_pop` column is the state's total population.



In [2]:
import pandas as pd
homelessness = pd.read_csv("homelessness.csv",index_col= 0)
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


In [5]:

# Print information about homelessness
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
Index: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB


In [6]:

# Print the shape of homelessness
print(homelessness.shape)

(51, 5)


In [7]:
# Print a description of homelessness
homelessness.describe()

Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0
mean,7225.784314,3504.882353,6405637.0
std,15991.025083,7805.411811,7327258.0
min,434.0,75.0,577601.0
25%,1446.5,592.0,1777414.0
50%,3082.0,1482.0,4461153.0
75%,6781.5,3196.0,7340946.0
max,109008.0,52070.0,39461590.0


Insightful inspecting! You can see that `the average number of homeless individuals` in each state is about `7226`. Let's explore the DataFrame further.

### Sorting rows

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

Sort on …	Syntax
- one column : `df.sort_values("breed")`
- multiple columns : `df.sort_values(["breed", "weight_kg"])`

By combining `.sort_values()` with .head(), you can answer questions in the form, "What are the top cases where…?".



In [10]:
# Sort homelessness by individuals
homelessness.sort_values(by = 'individuals', ascending = False).head()


Unnamed: 0,region,state,individuals,family_members,state_pop
4,Pacific,California,109008.0,20964.0,39461588
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
9,South Atlantic,Florida,21443.0,9587.0,21244317
43,West South Central,Texas,19199.0,6111.0,28628666
47,Pacific,Washington,16424.0,5880.0,7523869


Most individual homeless people are living in states such as `California`, `New York`, `Florida`, `Texas`, `Washington`

In [11]:
homelessness.sort_values(by = 'individuals', ascending = False).tail()


Unnamed: 0,region,state,individuals,family_members,state_pop
45,New England,Vermont,780.0,511.0,624358
39,New England,Rhode Island,747.0,354.0,1058287
7,South Atlantic,Delaware,708.0,374.0,965479
34,West North Central,North Dakota,467.0,75.0,758080
50,Mountain,Wyoming,434.0,205.0,577601


In contrast, `Vermont` ,`Rhode Island`, `Delaware`, `North Dakota`, `Wyoming` have less single homeless people compared to North and Central States

## Subsetting rows

A large part of data science is about finding which bits of your dataset are interesting. 
One of the simplest techniques for this is to find a subset of rows that match some criteria.
This is sometimes known as `filtering rows` or `selecting rows`.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]

dogs[dogs["color"] == "tan"]

You can filter for multiple conditions at once by using the "bitwise and" operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]


In [13]:
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness['individuals']>10_000]
# See the result
ind_gt_10k

Unnamed: 0,region,state,individuals,family_members,state_pop
4,Pacific,California,109008.0,20964.0,39461588
9,South Atlantic,Florida,21443.0,9587.0,21244317
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,Pacific,Oregon,11139.0,3337.0,4181886
43,West South Central,Texas,19199.0,6111.0,28628666
47,Pacific,Washington,16424.0,5880.0,7523869


# Subsetting rows by categorical variables

### Filter homelessness for cases where the USA census state is in the list of `Mojave states`, `canu`, assigning to `mojave_homelessness`.


In [14]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

mojave_homelessness = homelessness[homelessness['state'].isin(canu)]

# See the result
mojave_homelessness

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
4,Pacific,California,109008.0,20964.0,39461588
28,Mountain,Nevada,7058.0,486.0,3027341
44,Mountain,Utah,1904.0,972.0,3153550


### Adding new columns
You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as `transforming`, `mutating`, and `feature engineering`.

In [None]:
# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']

# Add p_homeless col as proportion of total homeless population to the state population
homelessness['p_homeless'] = homelessness['total']/homelessness['state_pop']

# See the result
homelessness


In [27]:
homelessness.groupby(['region'])['p_homeless'].agg(['max','min']).sort_values(by = 'max',ascending = False)

Unnamed: 0_level_0,max,min
region,Unnamed: 1_level_1,Unnamed: 2_level_1
South Atlantic,0.009841,0.000689
Mid-Atlantic,0.004705,0.001056
Pacific,0.004597,0.002742
New England,0.002916,0.00104
Mountain,0.002492,0.000912
West North Central,0.001319,0.000715
East South Central,0.001164,0.000454
West South Central,0.000982,0.000656
East North Central,0.000878,0.000785


### Combo-attack!

You've seen the four most common types of data manipulation: `sorting rows`, `subsetting columns`, `subsetting rows`, and `adding new columns`. 

In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, `"Which state has the highest number of homeless individuals per 10,000 people in the state?"` Combine your new pandas skills to find out.

1.Add a column to homelessness, `indiv_per_10k`, containing the `number of homeless individuals` per ten thousand people in each state, using `state_pop` for state population.

In [29]:
homelessness['indiv_per_10k'] = 10_000*homelessness['individuals']/homelessness['state_pop']
homelessness['indiv_per_10k'].agg(['max','mean','min'])

max     53.738381
mean    10.430003
min      3.435066
Name: indiv_per_10k, dtype: float64

2.Subset rows where `indiv_per_10k` is higher than `20`, assigning to `high_homelessness`.

In [32]:
high_homelessness = homelessness[homelessness['indiv_per_10k']>20]

# Sort high_homelessness by descending indiv_per_10k, assigning to high_homelessness_srt.
high_homelessness_srt = high_homelessness.sort_values(by = 'indiv_per_10k',ascending = False)

# Select only the state and indiv_per_10k columns of high_homelessness_srt and save as result. Look at the result.
result = high_homelessness_srt[['state','indiv_per_10k']]
result

Unnamed: 0,state,indiv_per_10k
8,District of Columbia,53.738381
11,Hawaii,29.079406
4,California,27.623825
37,Oregon,26.636307
28,Nevada,23.314189
47,Washington,21.829195
32,New York,20.392363


Cool combination! `District of Columbia` has the highest number of homeless individuals - almost `54 per ten thousand people`. This is almost double the number of the next-highest state, Hawaii. If you combine new column addition, row subsetting, sorting, and column selection, you can answer lots of questions like this.