## Data Integration and Analysis

In this lab, I will focus on integrating multiple datasets and performing comprehensive data analysis. The goal is to explore techniques for combining, transforming, and analyzing data from various sources to uncover meaningful insights. Key steps in this process include:

1. **Data Integration**: Combining multiple datasets into a single, coherent structure.
2. **Data Cleaning**: Handling missing or inconsistent data to ensure accuracy in the analysis.
3. **Exploratory Data Analysis**: Using visualizations and summary statistics to reveal patterns and trends across the integrated dataset.

This lab emphasizes the importance of working with diverse data sources and the challenges of integrating them effectively for analysis. The data we will use comes from the City of Seattle. It consists of police beats in the Seattle area and provides information on their geographic locations. 

In [2]:
import pandas as pd
beats_data = pd.read_csv('Police_Beat_and_Precinct_Centerpoints.csv')

In [3]:
beats_data.head()

Unnamed: 0,Name,Location 1,Latitude,Longitude
0,B1,"(47.7097756394592, -122.370990523069)",47.70978,-122.37099
1,B2,"(47.6790521901374, -122.391748391741)",47.67905,-122.39175
2,B3,"(47.6812920482227, -122.364236159741)",47.68129,-122.36424
3,C1,"(47.6342500180223, -122.315684762418)",47.63425,-122.31568
4,C2,"(47.6192385752996, -122.313557430551)",47.61924,-122.31356


### Inspection

In [5]:
beats_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Name        57 non-null     object 
 1   Location 1  57 non-null     object 
 2   Latitude    57 non-null     float64
 3   Longitude   57 non-null     float64
dtypes: float64(2), object(2)
memory usage: 1.9+ KB


In [6]:
missing_values = beats_data.isnull().sum()
missing_values

Name          0
Location 1    0
Latitude      0
Longitude     0
dtype: int64

In [7]:
''' There is 57 records, 4 variables, No missing data '''

' There is 57 records, 4 variables, No missing data '

### Using an API for Census Data Integration

In this section, I use the **censusegocode** package to retrieve census tract information based on latitude and longitude coordinates. The `get_census_tract()` function returns the census tract **GEOID** for a given location, enabling integration of census data with the beats dataset.

#### Purpose:
- Retrieve census tract data using geographic coordinates.
- Map locations to their respective **GEOID** for further analysis.

In [9]:
import censusgeocode as cg

In [10]:
import censusgeocode as cg

def get_census_tract(lon, lat):
    try:
        result = cg.coordinates(x=lon, y=lat)
        if result and 'Census Tracts' in result and result['Census Tracts']:
            return result['Census Tracts'][0]['GEOID']
        else:
            return None
    except Exception as e:
        print(f"Error retrieving census tract: {e}")
        return None

# Example usage
longitude = -122.37099
latitude = 47.70978
print(get_census_tract(longitude, latitude))

53033001400


In [11]:
get_census_tract(-118.321495, 34.134117) #should return '06037980009'

'06037980009'

### Getting Census Tracts for Beats Data

In this section, I retrieve census tract information for each beat in the beats dataset using the previously defined `get_census_tract()` function. Additionally, I check if each beat is located in King County (tracts starting with '53033').

- **Census Tract Retrieval**: Census tracts are assigned to beats based on their geographic coordinates.
- **King County Check**: Each tract is verified to ensure it's within King County (code '53033').
- **Output**: The updated dataset is saved for future use.zz

This process helps in associating geographic data with specific beats for further analysis.

In [26]:
beats_data['CensusTract'] = beats_data.apply(lambda row: get_census_tract(row['Longitude'], row['Latitude']), axis=1)

In [14]:
beats_data['InKingCounty'] = beats_data['CensusTract'].apply(lambda x: x.startswith('53033') if x else False)

In [15]:
not_in_king_county = beats_data[~beats_data['InKingCounty']]
if not not_in_king_county.empty:
    print("Tracts not in King County:")
    print(not_in_king_county)

In [16]:
#save updated dataset
output_file_path = '/Users/yourusername/Desktop/DevKit/Updated_Police_Beat_and_Precinct_Centerpoints.csv' 