# Knowledge Check 2

## To-Do List

- [x] Make a .ipynb that contains the following:
    - [x] Find and access a data set in any way.
        - [x] Fix character strings that aren't formatted correctly.
        - [x] Correct column names if they're misnamed.
- [x] Commit your changes.
- [x] Push your changes to GitHub
    - [x] turn in the GitHub link into Google Classroom

## Outline
- Title
    - To-Do List
    - Outline
    - Libraries
    - Access Data Set
        - Setting Variable for API Call
        - Making API Call
    - Column list
    - Column Order
    - Column Types
    - Check for Missing Data


## Libraries

In [13]:
import requests
import pandas as pd
import numpy as np

## Access a Data Set

Key for API, please use your own key from the following source if you wish to run the code:

https://api.census.gov/data/key_signup.html

In [3]:
key = 'YOUR KEY HERE'

### Setting Variable for API Call

Variables will pull the name, data wanted, state and county FIPS codes

In [4]:
# Name and data wanted (total population for state and county level)
# 'get' statement for requests parameters
total_all = ('NAME,PCT12_001N')
# Destination API
url = 'https://api.census.gov/data/2020/dec/dhc'
# 'in' statement for requests parameters
how = 'state:*'
# 'for' statement for requests parameters
where = 'county:*'

### Making API Call

Using requests.request to call API with variables fed from previous cell. <br>

Saving response to Pandas Dataframe, printing shape and DF for verifications sake


In [5]:
# Making call to API
r = requests.request('GET', url, params={"get": total_all,
                                         "for": where,
                                         "in": how,
                                         "key": key
                                         })

# IF/ELSE to account for failed call
if r.status_code == 200:
    data = r.json()
    
    # Skip first row containing column names we don't want (['0','1','2','3'])
    df_data = data[1:]

    # Build Pandas dataframe from json data starting with row we want
    DF_total_all = pd.DataFrame(df_data, columns=data[0])

    # Join 'state' and 'county' columns into a new column 'FIPS'
    DF_total_all['FIPS'] = DF_total_all['state'] + DF_total_all['county']

    # Display the created data frame
    print(f"'DF_total_all' shape:")
    print(DF_total_all.shape)
    print(f"Data frame 'DF_total_all':")
    print(DF_total_all)

else:
    # API call failed
    print(f"API call for '{total_all}' failed. Status code:", r.status_code)

'DF_total_all' shape:
(3221, 5)
Data frame 'DF_total_all':
                             NAME PCT12_001N state county   FIPS
0        Bullitt County, Kentucky      82217    21    029  21029
1         Butler County, Kentucky      12371    21    031  21031
2       Caldwell County, Kentucky      12649    21    033  21033
3       Calloway County, Kentucky      37103    21    035  21035
4       Magoffin County, Kentucky      11637    21    153  21153
...                           ...        ...   ...    ...    ...
3216    Haywood County, Tennessee      17864    47    075  47075
3217  Henderson County, Tennessee      27842    47    077  47077
3218         Howard County, Texas      34860    48    227  48227
3219       Hudspeth County, Texas       3202    48    229  48229
3220           Hunt County, Texas      99956    48    231  48231

[3221 rows x 5 columns]


## Column list

Show the column names, change column names from list, show new column names

In [6]:
col_names = ['Location','Total','State','County','FIPS']

# Print the columns
print("Columns of DF_total_all:")
for column in DF_total_all.columns:
    print(column)

# Assign the new column names to the data frame in place
DF_total_all.columns = col_names

# Print the columns
print("Columns of DF_total_all:")
for column in DF_total_all.columns:
    print(column)


Columns of DF_total_all:
NAME
PCT12_001N
state
county
FIPS
Columns of DF_total_all:
Location
Total
State
County
FIPS


## Column Order

Changing order of column to keep wanted data and location data easily grouped

In [7]:
columns = DF_total_all.columns.tolist()
columns[0], columns[1] = columns[1], columns[0]
DF_total_all = DF_total_all.reindex(columns=columns)
print(columns)

['Total', 'Location', 'State', 'County', 'FIPS']


## Column Types

Checking datatypes, changing types, and checking again.

In [9]:
DF_total_all.dtypes

Total       object
Location    object
State       object
County      object
FIPS        object
dtype: object

In [10]:
DF_total_all[['Total']] = DF_total_all[['Total']].astype(int)

In [12]:
print(DF_total_all.dtypes)

Total        int32
Location    object
State       object
County      object
FIPS        object
dtype: object


## Check for Missing Data

In [16]:
# Find the rows with missing data
rows_with_missing_data = DF_total_all[DF_total_all.isnull().any(axis=1)]

# Find the columns with missing data
columns_with_missing_data = DF_total_all.columns[DF_total_all.isnull().any()]

# Check if there is missing data in the dataframe
if not rows_with_missing_data.empty or len(columns_with_missing_data) > 0:
    print(f"Missing data found in DataFrame '{DF_total_all}'")

    # Replace NaN values with string 'NaN'
    DF_total_all = DF_total_all.replace({np.nan: 'NaN'})

    # Print the rows with missing data
    print("Rows with missing data:")
    print(rows_with_missing_data)

    # Print the columns with missing data
    print("Columns with missing data:")
    print(columns_with_missing_data)
    print()

    # Search for 'NaN' string in the dataframe
    rows_with_nan_string = DF_total_all[DF_total_all.eq('NaN').any(axis=1)]
    columns_with_nan_string = DF_total_all.columns[DF_total_all.eq('NaN').any()]

    # Print the rows with 'NaN' string
    print("Rows with 'NaN' string:")
    print(rows_with_nan_string)

    # Print the columns with 'NaN' string
    print("Columns with 'NaN' string:")
    print(columns_with_nan_string)
    print()
else:
    print(f"No missing data found in DataFrame '{DF_total_all}'")

No missing data found in DataFrame '      Total                     Location State County   FIPS
0     82217     Bullitt County, Kentucky    21    029  21029
1     12371      Butler County, Kentucky    21    031  21031
2     12649    Caldwell County, Kentucky    21    033  21033
3     37103    Calloway County, Kentucky    21    035  21035
4     11637    Magoffin County, Kentucky    21    153  21153
...     ...                          ...   ...    ...    ...
3216  17864    Haywood County, Tennessee    47    075  47075
3217  27842  Henderson County, Tennessee    47    077  47077
3218  34860         Howard County, Texas    48    227  48227
3219   3202       Hudspeth County, Texas    48    229  48229
3220  99956           Hunt County, Texas    48    231  48231

[3221 rows x 5 columns]'
