    # Demographic Data Analysis
## Milestone 1: Help Level 3

## 1. Data Collection

In [1]:
# import the Pandas library
import pandas as pd

You can load the csv files with the Pandas function [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).   
Check the following parameters : sep, names, header, and usecols.

### Education file: "education.csv"
The file contains the following columns:  

```Column  | Description
--------| --------------------------------
1°       | name of the state         
2°       | % high school graduate or higher  
3°       | high School rank  
4°      | % bachelor degree or higher  
5°      | bachelor degree rank  
6°      | % advanced degree or higher  
7°      | advanced degree rank
```



In [4]:
# There is no mention of column names. We should better make a first read of the file to see what we find in it.
# Load and inspect the file education.csv
education_csv_name = 'work/csv/education.csv'
# df = pd.read_csv('%s' % education_csv_name)

In [5]:
# Load and inspect the file education.csv 
# You can use the following columns names: 'State','HSGradPer','HSRank','BADegPer','BARank','AdvDegPer','AdvRank'
# Ignore the rank columns or delete them after loading

# Read the CSV file with custom column names
# Skip the initial rows that are not part of the data
df = pd.read_csv('%s' % education_csv_name, skiprows=2,
                 delimiter=';',
                 header=None,
                 names=['State','HSGradPer','HSRank','BADegPer','BARank','AdvDegPer','AdvRank'])

print(df.head())

# # Drop the columns corresponding to ranks
df.drop(['HSRank', 'BARank', 'AdvRank'], axis=1, inplace=True)

# Show the first few rows of the DataFrame to inspect the data
print(df)

            State HSGradPer  HSRank BADegPer  BARank AdvDegPer  AdvRank
0         Montana     93.0%     1.0    30.7%    21.0     10.1%     33.0
1   New Hampshire     92.8%     2.0    36.0%     9.0     13.8%     10.0
2       Minnesota     92.8%     3.0    34.8%    11.0     11.8%     18.0
3         Wyoming     92.8%     4.0    26.7%    41.0      9.3%     39.0
4          Alaska     92.4%     5.0    29.0%    28.0     10.4%     29.0
                    State HSGradPer BADegPer AdvDegPer
0                 Montana     93.0%    30.7%     10.1%
1           New Hampshire     92.8%    36.0%     13.8%
2               Minnesota     92.8%    34.8%     11.8%
3                 Wyoming     92.8%    26.7%      9.3%
4                  Alaska     92.4%    29.0%     10.4%
5            North Dakota     92.3%    28.9%      7.8%
6                 Vermont     92.3%    36.8%     15.0%
7                   Maine     92.1%    30.3%     10.9%
8                    Iowa     91.8%    27.7%      9.0%
9                 

In [19]:
# We want to use the column *State* as index. Lets's run some checks first.
# Check that there are no extraneous values in the column State. If you find some, clean them.
print(df.index)
print(df.columns)
print(df.dtypes)

# Check for duplicate values
if df['State'].duplicated().any():
    print("Duplicate values found in the State column:")
    print(df[df['State'].duplicated(keep=False)].sort_values(by='State'))
else:
    print("No duplicate values found in the State column.")

# Check for Missing or NaN Values
if df['State'].isna().any():
    print("NaN values found in the State column:")
    print(df[df['State'].isna()])
else:
    print("No NaN values found in the State column.")

non_strings = df[df['State'].apply(lambda x: not isinstance(x, str))]
if not non_strings.empty:
    print("Non-string values found in the State column:")
    print(non_strings)
else:
    print("All values in the State column are strings.")

# Set the index to 'State'
df.set_index('State', inplace=True)

print(df.index)
print(df.columns)
print(df.dtypes)

RangeIndex(start=0, stop=52, step=1)
Index(['State', 'HSGradPer', 'BADegPer', 'AdvDegPer'], dtype='object')
State        object
HSGradPer    object
BADegPer     object
AdvDegPer    object
dtype: object
No duplicate values found in the State column.
No NaN values found in the State column.
All values in the State column are strings.
Index([' Montana', ' New Hampshire', ' Minnesota', ' Wyoming', ' Alaska',
       ' North Dakota', ' Vermont', ' Maine', ' Iowa', ' Utah', ' Wisconsin',
       ' Hawaii', ' South Dakota', ' Colorado', ' Nebraska', ' Washington',
       ' Kansas', ' District of Columbia', ' Massachusetts', ' Idaho',
       ' Michigan', ' Connecticut', ' Oregon', ' Pennsylvania', ' Maryland',
       ' Ohio', ' Delaware', ' Missouri', ' New Jersey', ' Virginia',
       ' Illinois', ' Indiana', ' Florida', ' Oklahoma', ' Rhode Island',
       ' United States', ' North Carolina', ' South Carolina', ' Tennessee',
       ' Georgia', ' New York', ' West Virginia', ' Nevada', ' Arkans

You can use a list comprehension to return all states in the column States that have extraneous characters. You can apply the Python method [isalpha](https://docs.python.org/3/library/stdtypes.html) to a string to check whether all characters in it are alphabetic. 

In [20]:
# In the column State replace the whitespaces with underscores

# Strip leading and trailing whitespaces from the index
df.index = df.index.str.strip()

#Replace all remaining whitespaces with underscores
df.index = df.index.str.replace(' ', '_')

# Show the modified index
print(df.index)

# Check if index values are unique
are_indexes_unique = df.index.is_unique

# Print the result
print("Are index values unique?", are_indexes_unique)


Index(['Montana', 'New_Hampshire', 'Minnesota', 'Wyoming', 'Alaska',
       'North_Dakota', 'Vermont', 'Maine', 'Iowa', 'Utah', 'Wisconsin',
       'Hawaii', 'South_Dakota', 'Colorado', 'Nebraska', 'Washington',
       'Kansas', 'District_of_Columbia', 'Massachusetts', 'Idaho', 'Michigan',
       'Connecticut', 'Oregon', 'Pennsylvania', 'Maryland', 'Ohio', 'Delaware',
       'Missouri', 'New_Jersey', 'Virginia', 'Illinois', 'Indiana', 'Florida',
       'Oklahoma', 'Rhode_Island', 'United_States', 'North_Carolina',
       'South_Carolina', 'Tennessee', 'Georgia', 'New_York', 'West_Virginia',
       'Nevada', 'Arkansas', 'Alabama', 'Kentucky', 'New_Mexico', 'Louisiana',
       'Mississippi', 'Texas', 'California', 'Arizona'],
      dtype='object', name='State')
Are index values unique? True


You can use the Series method [str.replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html?highlight=str%20replace#pandas.Series.str.replace)

In [21]:
# Get rid of the summary row "United_States"

# Remove the row with index 'United_States'
# df.drop('United_States', inplace=True)
df.drop('United_States', inplace=True, errors='ignore')

print(df)


                     HSGradPer BADegPer AdvDegPer
State                                            
Montana                  93.0%    30.7%     10.1%
New_Hampshire            92.8%    36.0%     13.8%
Minnesota                92.8%    34.8%     11.8%
Wyoming                  92.8%    26.7%      9.3%
Alaska                   92.4%    29.0%     10.4%
North_Dakota             92.3%    28.9%      7.8%
Vermont                  92.3%    36.8%     15.0%
Maine                    92.1%    30.3%     10.9%
Iowa                     91.8%    27.7%      9.0%
Utah                     91.8%    32.5%     11.0%
Wisconsin                91.7%    29.0%      9.9%
Hawaii                   91.6%    32.0%     10.8%
South_Dakota             91.4%    27.8%      8.3%
Colorado                 91.1%    39.4%     14.6%
Nebraska                 90.9%    30.6%     10.2%
Washington               90.8%    34.5%     12.7%
Kansas                   90.5%    32.3%     11.7%
District_of_Columbia     90.3%    56.6%     32.8%


You can use the DataFrame method [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

In [22]:
# Make a last check that there is only one row for each state
# Count the number of unique index values (states)
num_unique_states = df.index.nunique()

# Count the total number of rows in the DataFrame
total_rows = df.shape[0]

# Check if each state has only one corresponding row
if num_unique_states == total_rows:
    print("There is only one row for each state.")
else:
    print("Some states appear more than once.")

There is only one row for each state.


In [23]:
# Set the column State as the index
# Order the dataframe according to the index
# Inspect the dataframe

# Set the 'State' column as the index
# df.set_index('State', inplace=True)

# Sort the DataFrame by the index (State)
df.sort_index(inplace=True)

# Show the first few rows of the DataFrame to inspect the data
print(df)

                     HSGradPer BADegPer AdvDegPer
State                                            
Alabama                  85.3%    24.5%      9.1%
Alaska                   92.4%    29.0%     10.4%
Arizona                  82.1%    28.4%     10.7%
Arkansas                 85.6%    22.0%      7.9%
California               82.5%    32.6%     12.2%
Colorado                 91.1%    39.4%     14.6%
Connecticut              90.2%    38.4%     17.0%
Delaware                 89.3%    31.0%     12.9%
District_of_Columbia     90.3%    56.6%     32.8%
Florida                  87.6%    28.5%     10.3%
Georgia                  86.3%    29.9%     11.4%
Hawaii                   91.6%    32.0%     10.8%
Idaho                    90.2%    26.8%      8.5%
Illinois                 88.6%    33.4%     13.0%
Indiana                  88.3%    25.3%      9.2%
Iowa                     91.8%    27.7%      9.0%
Kansas                   90.5%    32.3%     11.7%
Kentucky                 85.2%    23.2%      9.6%


In [27]:
# Check that there is only one row for each state and then set *State* as the index of the table.

In [24]:
# Check that all the numerical column were loaded as a number
# If that's not the case, find out why, correct it, and cast the columns as numbers
print(df.dtypes)

print(df['HSGradPer'].unique())
print(df['BADegPer'].unique())
print(df['AdvDegPer'].unique())

df['HSGradPer'] = df['HSGradPer'].str.replace('%', '')
df['BADegPer'] = df['BADegPer'].str.replace('%', '')
df['AdvDegPer'] = df['AdvDegPer'].str.replace('%', '')

df['HSGradPer'] = pd.to_numeric(df['HSGradPer'], errors='coerce')
df['BADegPer'] = pd.to_numeric(df['BADegPer'], errors='coerce')
df['AdvDegPer'] = pd.to_numeric(df['AdvDegPer'], errors='coerce')

print(df.isna().sum())

print(df.dtypes)

HSGradPer    object
BADegPer     object
AdvDegPer    object
dtype: object
['85.3%' '92.4%' '82.1%' '85.6%' '82.5%' '91.1%' '90.2%' '89.3%' '90.3%'
 '87.6%' '86.3%' '91.6%' '88.6%' '88.3%' '91.8%' '90.5%' '85.2%' '84.3%'
 '92.1%' '89.8%' '92.8%' '83.4%' '89.2%' '93.0%' '90.9%' '85.8%' '85.0%'
 '86.1%' '86.9%' '92.3%' '87.5%' '76.7%' '89.9%' '87.3%' '86.5%' '91.4%'
 '82.8%' '89.0%' '90.8%' '85.9%' '91.7%']
['24.5%' '29.0%' '28.4%' '22.0%' '32.6%' '39.4%' '38.4%' '31.0%' '56.6%'
 '28.5%' '29.9%' '32.0%' '26.8%' '33.4%' '25.3%' '27.7%' '32.3%' '23.2%'
 '23.4%' '30.3%' '39.0%' '42.1%' '28.1%' '34.8%' '21.3%' '28.2%' '30.7%'
 '30.6%' '23.7%' '36.0%' '38.1%' '26.9%' '35.3%' '28.9%' '27.2%' '24.8%'
 '30.1%' '33.0%' '27.0%' '27.8%' '26.1%' '28.7%' '32.5%' '36.8%' '37.6%'
 '34.5%' '19.9%' '26.7%']
['9.1%' '10.4%' '10.7%' '7.9%' '12.2%' '14.6%' '17.0%' '12.9%' '32.8%'
 '10.3%' '11.4%' '10.8%' '8.5%' '13.0%' '9.2%' '9.0%' '11.7%' '9.6%'
 '8.1%' '10.9%' '18.0%' '18.7%' '11.0%' '11.8%' '8.0%' '10.1%

In [25]:
# Make a last inspection of the dataframe edu
print(df)

                      HSGradPer  BADegPer  AdvDegPer
State                                               
Alabama                    85.3      24.5        9.1
Alaska                     92.4      29.0       10.4
Arizona                    82.1      28.4       10.7
Arkansas                   85.6      22.0        7.9
California                 82.5      32.6       12.2
Colorado                   91.1      39.4       14.6
Connecticut                90.2      38.4       17.0
Delaware                   89.3      31.0       12.9
District_of_Columbia       90.3      56.6       32.8
Florida                    87.6      28.5       10.3
Georgia                    86.3      29.9       11.4
Hawaii                     91.6      32.0       10.8
Idaho                      90.2      26.8        8.5
Illinois                   88.6      33.4       13.0
Indiana                    88.3      25.3        9.2
Iowa                       91.8      27.7        9.0
Kansas                     90.5      32.3     

### File Life Expectancy: "life_expectancy.csv"
The file contains the following columns  

```
Column        | Description
--------------| --------------------------------
State         | name of the state  
LifeExp2018   | life expectancy (2017)   
LifeExp2010   | life expectancy (2010)
MaleLifeExp   | male life expectancy
FemLifeExp    | female life expectancy
```


In [35]:
# load the file life_expectancy.csv and make a first inspection

life_exp_csv_name = 'work/csv/life_expectancy.csv'
life_exp = pd.read_csv('%s' % life_exp_csv_name, delimiter=';')

print(life_exp)
print(life_exp.columns)
if 'State' in df.columns:
    df['State'] = df['State'].str.strip()
else:
    print("Column 'State' not found in DataFrame.")


                        State LifeExp2018  LifeExp2010  MaleLifeExp  \
0                      Hawaii        82.3         81.4         79.3   
1                  California        81.6         80.6         79.4   
2                 Puerto Rico        81.3         78.7         77.6   
3                    New York        81.3         80.3         79.0   
4         U.S. Virgin Islands        81.2         79.2         76.3   
5                   Minnesota        81.0         80.8         79.0   
6                 Connecticut        80.9         80.7         78.7   
7                        Guam        80.7         78.2         77.6   
8                    Colorado        80.5         80.1         78.5   
9               Massachusetts        80.5         80.5         78.2   
10                 Washington        80.4         80.1         78.4   
11                 New Jersey        80.4         80.0         78.2   
12                    Florida        80.0         79.0         77.3   
13    

In [7]:
# We'll follow the same steps for the file life_expectancy
# Since you will have to repeat these steps for the other files, we are going to define a function to clean a dataset
def set_state_as_index(df):
    # Clean the column 'State', eliminating extraneous whitespaces at both ends
    df['State'] = df['State'].str.strip()

    # Replace the middle whitespaces with underscores
    df['State'] = df['State'].str.replace(' ', '_')

    # Check that there are no duplicates in the column 'State'
    if df['State'].duplicated().any():
        print("Duplicate values found in the State column:")
        print(df[df['State'].duplicated(keep=False)].sort_values(by='State'))
        # Additional steps to handle duplicates would go here
    else:
        print("No duplicate values found in the State column.")
    
    # Set the 'State' column as the index of the DataFrame and sort by the index
    df.set_index('State', inplace=True)
    df.sort_index(inplace=True)
    
    # if there is a summary row "United States", drop it
    # If there is a summary row "United_States", drop it
    if "United_States" in df.index:
        df.drop("United_States", inplace=True)

    return df

In [36]:
# Run the function set_state_as_index on life_exp
life_exp = set_state_as_index(life_exp)

No duplicate values found in the State column.


In [37]:
# inspect the dataframe
print(life_exp)

                         LifeExp2018  LifeExp2010  MaleLifeExp  FemLifeExp
State                                                                     
Alabama                         75.4         75.4         72.6        78.1
Alaska                          78.8         78.0         76.7        81.2
American_Samoa                  74.8         74.0         73.0        77.0
Arizona                         79.9         79.3         77.5        82.3
Arkansas                        75.9         76.0         73.1        78.6
California                      81.6         80.6         79.4        83.8
Colorado                        80.5         80.1         78.5        82.5
Connecticut                     80.9         80.7         78.7        83.0
Delaware                        78.4         78.3         76.2        80.6
District_of_Columbia            78.6         76.5         75.7        81.3
Florida                         80.0         79.0         77.3        82.6
Georgia                  

Now lets's check which numeric columns contain something else than digits or a dot.   
We are going to use regular expressions to check if the elements of the dataframe contain numbers (digits with or without a dot).  
If you want to review regular expression, see the link we provided in the project

In [34]:
# define a function that uses regular expresions to check if a string contains only digits with or without a dot
import re
def check_digit_or_dot(x):
    if pd.isna(x):  # Skip NaN values
        return False
    pattern = r'^\d+([.,]\d+)*$'  # This regular expression matches digits with or without a dot
    return bool(re.match(pattern, str(x)))


In [None]:
# create a boolean dataframe that is the result of applying the function check_digit_or_dot to all the elements of life_exp
bool_life_exp = life_exp.applymap(check_digit_or_dot)

# inspect the rows of the boolean dataframe to see if there are any False values
print("Boolean DataFrame:")
print(bool_life_exp)

# Print the rows of df where there is at least one value that's not a number
print("\nRows containing at least one value that's not a number:")
print(life_exp[~bool_life_exp.all(axis=1)])


In [51]:
# coerce the columns to be numeric
for col in life_exp.columns:
    life_exp[col] = pd.to_numeric(life_exp[col], errors='raise')

# check if all the columns are numeric
print(life_exp.dtypes)

LifeExp2018    float64
LifeExp2010    float64
MaleLifeExp    float64
FemLifeExp     float64
dtype: object


In [52]:
# inspect the whole dataframe life_exp to see if all is ok
print(life_exp)

                          LifeExp2018  LifeExp2010  MaleLifeExp  FemLifeExp
State                                                                      
Alabama                          75.4         75.4         72.6        78.1
Alaska                           78.8         78.0         76.7        81.2
American_Samoa                   74.8         74.0         73.0        77.0
Arizona                          79.9         79.3         77.5        82.3
Arkansas                         75.9         76.0         73.1        78.6
California                       81.6         80.6         79.4        83.8
Colorado                         80.5         80.1         78.5        82.5
Connecticut                      80.9         80.7         78.7        83.0
Delaware                         78.4         78.3         76.2        80.6
District_of_Columbia             78.6         76.5         75.7        81.3
Florida                          80.0         79.0         77.3        82.6
Georgia     

## Crime file: "crime.csv"

The file contains the following columns:

```
Column   |  Description
---------| --------------------------------
1°       |  name of the state                              
2°       |  population (total inhabitants) (2015)                   
3°       |  murders and non-negligent manslaughter (total deaths) (2015)
4°       |  murders (total deaths) (2015)
5°       |  gun murders (total deaths) (2015)
6°       |  gun ownership (%) (2013)
7°       |  murders and non-negligent manslaughter rate (per 100,000) (2015)
8°       |  murder rate (per 100,000) 
9°       |  gun murder rate (per 100,000)
```


Follow the same procedure as with the other files



In [20]:
crime_csv_name = 'work/csv/crime.csv'
crime_df = pd.read_csv('%s' % crime_csv_name, delimiter=';', skipfooter=11)

print(crime_df)

                   State Population\n(total inhabitants) \n(2015) [2]  \
0                Alabama                                    4,853,875   
1                 Alaska                                      737,709   
2                Arizona                                    6,817,565   
3               Arkansas                                    2,977,853   
4             California                                   38,993,940   
5               Colorado                                    5,448,819   
6            Connecticut                                    3,584,730   
7               Delaware                                      944,076   
8   District of Columbia                                      670,377   
9                Florida                                   20,244,914   
10               Georgia                                   10,199,398   
11                Hawaii                                    1,425,157   
12                 Idaho                           

  crime_df = pd.read_csv('%s' % crime_csv_name, delimiter=';', skipfooter=11)


In [26]:
print(crime_df.index)
print(crime_df.columns)

crime_df = set_state_as_index(crime_df)

Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'District_of_Columbia', 'Florida', 'Georgia',
       'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky',
       'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
       'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New_Hampshire', 'New_Jersey', 'New_Mexico', 'New_York',
       'North_Carolina', 'North_Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode_Island', 'South_Carolina', 'South_Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West_Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object', name='State')
Index(['Population\n(total inhabitants) \n(2015) [2]',
       'Murders and\nNonnegligent\nManslaughter\n(total deaths) \n(2015) [1]',
       'Murders\n(total deaths) \n(2015) [3]',
       'Gun Murders\n(total deaths) \n(2015) [3]',
       'Gun\nOw

KeyError: 'State'

In [28]:
print(crime_df)
print(crime_df.index)
print(crime_df.columns)

                     Population\n(total inhabitants) \n(2015) [2]  \
State                                                               
Alabama                                                 4,853,875   
Alaska                                                    737,709   
Arizona                                                 6,817,565   
Arkansas                                                2,977,853   
California                                             38,993,940   
Colorado                                                5,448,819   
Connecticut                                             3,584,730   
Delaware                                                  944,076   
District_of_Columbia                                      670,377   
Florida                                                20,244,914   
Georgia                                                10,199,398   
Hawaii                                                  1,425,157   
Idaho                             

In [37]:
# create a boolean dataframe that is the result of applying the function check_digit_or_dot to all the elements of life_exp
bool_crime_df = crime_df.applymap(check_digit_or_dot)

# inspect the rows of the boolean dataframe to see if there are any False values
print("Boolean DataFrame:")
print(bool_crime_df)

# Print the rows of df where there is at least one value that's not a number
print("\nRows containing at least one value that's not a number:")
print(crime_df[~bool_crime_df.all(axis=1)])

crime_df = crime_df[bool_crime_df.all(axis=1)]

Boolean DataFrame:
                      Population\n(total inhabitants) \n(2015) [2]  \
State                                                                
Alabama                                                       True   
Alaska                                                        True   
Arizona                                                       True   
Arkansas                                                      True   
California                                                    True   
Colorado                                                      True   
Connecticut                                                   True   
Delaware                                                      True   
District_of_Columbia                                          True   
Florida                                                       True   
Georgia                                                       True   
Hawaii                                                        True   
I

In [47]:
crime_df = crime_df.replace(',', '', regex=True)

for col in crime_df.columns:
    # Convert to float first, since int conversion can't handle decimal strings directly
    crime_df[col] = crime_df[col].astype('float64')

    # If the float version of the column is the same as its integer version, convert to int64
    if (crime_df[col] == crime_df[col].astype('int64')).all():
        crime_df[col] = crime_df[col].astype('int64')


print(crime_df)
print(crime_df.index)
print(crime_df.columns)
print(crime_df.dtypes)


                      Population\n(total inhabitants) \n(2015) [2]  \
State                                                                
Alaska                                                      737709   
Arizona                                                    6817565   
Arkansas                                                   2977853   
California                                                38993940   
Colorado                                                   5448819   
Connecticut                                                3584730   
Delaware                                                    944076   
District_of_Columbia                                        670377   
Georgia                                                   10199398   
Hawaii                                                     1425157   
Idaho                                                      1652828   
Indiana                                                    6612768   
Iowa                

## Area file: "area.csv"

The file contains the following columns:    

```
Column    |  Description
----------| --------------------------------
State     |  name of the state                              
TotalRank |  total area rank  
TotalSqMi |  total area in SqMi
TotalKmQ  |  total area in KmQ
LandRank  |  land area rank
LandSqMi  |  land area in SqMi 
LandKmQ   |  land area in KmQ
LandPer   |  land area percentage 
WaterRank |  water area rank
WaterSqMi |  water area in SqMi
WaterKmQ  |  water area in KmQ
WaterPer  |  water area percentage
```

Follow the same procedure as the other files
Do not load the rank columns

In [55]:
area_csv_name = 'work/csv/area.csv'
area_df = pd.read_csv('%s' % area_csv_name, delimiter=';')

# Drop the columns corresponding to ranks
area_df.drop(['TotalRank', 'LandRank', 'WaterRank'], axis=1, inplace=True)

In [58]:
# Skip the initial rows that are not part of the data
print(area_df.columns)
print(area_df)

area_df = set_state_as_index(area_df)

Index(['State', 'TotalSqMi', 'TotalKmQ', 'LandSqMi', 'LandKmQ', 'LandPer',
       'WaterSqMi', 'WaterKmQ', 'WaterPer'],
      dtype='object')
             State  TotalSqMi  TotalKmQ   LandSqMi  LandKmQ  LandPer  \
0           Alaska  665384.04   1723337  570640.95  1477953    85.76   
1            Texas  268596.46    695662  261231.71   676587    97.26   
2       California  163694.74    423967  155779.22   403466    95.16   
3          Montana  147039.71    380831  145545.80   376962    98.98   
4       New Mexico  121590.30    314917  121298.15   314161    99.76   
5          Arizona  113990.30    295234  113594.08   294207    99.65   
6           Nevada  110571.82    286380  109781.18   284332    99.28   
7         Colorado  104093.67    269601  103641.89   268431    99.57   
8           Oregon   98378.54    254799   95988.01   248608    97.57   
9          Wyoming   97813.01    253335   97093.14   251470    99.26   
10        Michigan   96713.51    250487   56538.90   146435    58.

In [62]:
print(area_df.index)
print(area_df.columns)
print(area_df.dtypes)

print(area_df)

Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New_Hampshire', 'New_Jersey', 'New_Mexico', 'New_York',
       'North_Carolina', 'North_Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode_Island', 'South_Carolina', 'South_Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West_Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object', name='State')
Index(['TotalSqMi', 'TotalKmQ', 'LandSqMi', 'LandKmQ', 'LandPer', 'WaterSqMi',
       'WaterKmQ', 'WaterPer'],
      dtype='object')
TotalSqMi    float64
TotalKmQ       int64
LandSqMi     float64
LandKmQ        int64
LandPer      float64
WaterSqMi    float64
WaterKmQ    

## Income file: "income.xls"

For the file income we have an excel file: 'income.xlsx'. It contains the following columns:  

```
Column                                  |  Description    
--------------------------------------- | --------------------------------   
Rank                                    |  Rank for income in 2017   
State                                   |  name of the State    
Income2017                              |  median household  income in 2017   
Income2016                              |  median household  income in 2016  
...                                     |  ...  
Income2007                              |  median household  income in 2007
```

Follow the same procedure as with the other files  
Since this is an Excel file you'll need the Pandas function [read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)  
Do not load the rank column

In [68]:
income_xls_name = 'work/csv/income.xls'
income_df = pd.read_excel('%s' % income_xls_name, header=1)

income_df = set_state_as_index(income_df)
print(income_df.head())

No duplicate values found in the State column.
            Rank  Income2017  Income2016  Income2015  Income2014  Income2013  \
State                                                                          
Alabama       46       48123       46257       44765       42830       42849   
Alaska         8       73181       76440       73355       71583       72237   
Arizona       29       56581       53558       51492       50068       48510   
Arkansas      49       45869       45907       42798       44922       39376   
California     9       71805       67739       64500       61933       60190   

            Income2012  Income2011  Income2010  Income2009  Income2008  \
State                                                                    
Alabama          41574       41415       40474       40489       42666   
Alaska           67712       67825       64576       66953       68460   
Arizona          47826       46709       46789       48745       50958   
Arkansas         39018

In [71]:
print(income_df.index)
print(income_df.columns)
print(income_df.dtypes)

Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'District_of_Columbia', 'Florida', 'Georgia',
       'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky',
       'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
       'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New_Hampshire', 'New_Jersey', 'New_Mexico', 'New_York',
       'North_Carolina', 'North_Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode_Island', 'South_Carolina', 'South_Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West_Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object', name='State')
Index(['Rank', 'Income2017', 'Income2016', 'Income2015', 'Income2014',
       'Income2013', 'Income2012', 'Income2011', 'Income2010', 'Income2009',
       'Income2008', 'Income2007'],
      dtype='object')
Rank          int64
Income2017    int64
I

## Region file: "region.txt"
The file 'region.txt' file contains the following columns:  

```
Column     |  Description
---------- | --------------------------------
Name      |  name of the state 
Abb        |  abbreviation of the name of the state
Region     |  the region that each state belong to (Northeast, South, North Central, West)
Division   |  state division (New England, Middle Atlantic, South Atlantic, East South Central, West South Central, East North Central, West North Central, Mountain, and Pacific)
```

Follow the same procedure as the other files  
You can load it with the Pandas function [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)   
Check what column separator was used.   
Pay attention to the Division Column. There appear to be some inconsistencies. 

### Data Collection Report

We loaded four csv data files, an excel file, and a text file. The first five data files were acquired from the following internet sources (Wikipedia):   
* edu.csv : [List of U.S. states and territories by educational attainment](https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_educational_attainment)
* crime.csv: [Gun violence in the United States by state](https://en.wikipedia.org/wiki/Gun_violence_in_the_United_States_by_state)
* area.csv: [List of U.S. states and territories by area](https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_area)
* life_expectancy.csv: [List of U.S. states and territories by life expectancy](https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_life_expectancy)
The income file was provided in Excel format  
* income.xlsx: Household income in the United States (https://en.wikipedia.org/wiki/Household_income_in_the_United_States)
The region text data file was obtained from R package ‘datasets’ (state.x77)  
* region.txt 

#### Problems encountered:
** TO TO **
List the problems you encountered and the solution you took


## 2. Data Description

In this part , you'll examine the "surface" properties of the data. 
Check the number of rows of each dataframe.
You can use the DataFrame property [shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html#pandas.DataFrame.shape)

### Data Description Report

We acquired the following dataframes. All the dataframes have as index the State column.

###### edu
** TO DO **
number of rows

```
Column    |  Type   | Description
----------|---------|-------------------------
State     |  object | name of the state         
HSGradPer | float64 | % high school graduate or higher  
BADegPer  | float64 | % bachelor degree or higher  
AdvDegPer | float64 | % advanced degree or higher  
```

###### life_exp
** TO DO **
number of rows 

```
Column      |  Type   | Description
------------|---------|-------------------------
State       |  object | name of the state         
LifeExp2018 | float64 | life expectancy (2017)   
LifeExp2010 | float64 | life expectancy (2010)   
MaleLifeExp | float64 | male life expectancy  
FemLifeExp  | float64 | female life expectancy
```

###### crime
** TO DO **
number of rows 

```
Column       |  Type   | Description
-------------|---------|-------------------------
State        |  object | name of the state           
PopTot       |   int64 | population (total inhabitants) (2015) 
MurderNMTot  |   int64 | murders and non-negligent manslaughter (total deaths) (2015)
MurderTot    | float64 | murders (total deaths) (2015) 
GunMurderTot | float64 | gun murders (total deaths) (2015)
GunOwnerPer  | float64 | gun ownership (%) (2013) 
MurderNMRate | float64 | murders and non-negligent manslaughter rate (per 100,000) (2015) 
MurderRate   | float64 | murder rate (per 100,000)
GunMurderRate| float64 | gun murder rate (per 100,000)
```

###### area
** TO DO **
number of rows

```
Column    |  Type   | Description
----------|---------|-------------------------
State     |  object | name of the state           
TotalSqMi | float64 |  total area in SqMi
TotalKmQ  |   int64 |  total area in KmQ
LandSqMi  | float64 |  land area in SqMi 
LandKmQ   |   int64 |  land area in KmQ
LandPer   | float64 |  land area percentage 
WaterSqMi | float64 |  water area in SqMi
WaterKmQ  |   int64 |  water area in KmQ
WaterPer  | float64 |  water area percentage
```

###### income
** TO DO **
number of rows rows

```
Column     |  Type   | Description
-----------|---------|-------------------------
State      |  object | name of the state           
Income2017 |   int64 | median household income in 2017
Income2016 |   int64 | median household income in 2016
Income2015 |   int64 | median household income in 2015
Income2014 |   int64 | median household income in 2014
Income2013 |   int64 | median household income in 2013
Income2012 |   int64 | median household income in 2012
Income2011 |   int64 | median household income in 2012
Income2010 |   int64 | median household income in 2010
Income2009 |   int64 | median household income in 2009
Income2008 |   int64 | median household income in 2008
Income2007 |   int64 | median household income in 2007
```

###### region
** TO DO **
number of row

```
Column     |  Type   | Description
-----------|---------|-------------------------
State      |  object | name of the state    
Abb        |  object | abbreviation of the name of the state
Region     |  object | the region that each state belongs to (Northeast, South, North Central, West)          
Division   |  object | state divisions (New England, Middle Atlantic, South Atlantic, East South Central, West South Central, East North Central, West North Central, Mountain, and Pacific)           
```




## 3. Data Quality

In this part you’ll examine if the data is complete. Check if you have all the cases you need (in this case, all the US states) and if there are missing values.

In the U.S. there are 50 states, the federal district 'District of Columbia' and 5 inhabited territories: 'Puerto Rico', 'American Samoa', 'Guam', 'Northern Mariana Islands', and 'U.S. Virgin Islands’.

Check if all the files you loaded contain the fifty states and if there are differences, investigate where these differences come from. You can use the index of the DataFrame region as a guide.

Then check if there are missing values.

In [39]:
# define a variable state_names with the states in the index of region

In [40]:
# for each of the other DataFrames, check if they contain a state that's not in state_names


Check if the dataframes have missing values.   
You can use the DataFrame method [isnull](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html?highlight=isnull#pandas.DataFrame.isnull)



### Data Quality Report

In the U.S. there are fifty states, the federal district 'District of Columbia' and five inhabited territories: 'Puerto Rico', 'American Samoa', 'Guam', 'Northern Mariana Islands', and 'U.S. Virgin Islands'.

** TO DO **       
Explain your findings

## 4. Save the data

Save the dataframes for the next milestone.
You can use the DataFrame method [to_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv)