# Introduction to AI - HW 1
Name: **Fan Yang**  
Andrew id: **fy4**  
Last Modified Date: 09/09/2025

## Project Overview
The University of Wisconsin Population Health Institute released the "County Health Rankings," ranking all 3,142 counties in the United States based on health outcomes and behaviors. In this assignment, I will analyze the 2025 data using clustering and classification techniques to identify clusters of counties with similar health outcomes and behaviors, and build two predictive models to understand which factors influence health outcomes.

This assignment will cover the following:

1. Data Source and Preparation
2. Feature Selection
3. Exploratory Data Analysis (EDA)
4. Clustering
5. Two Supervised Learning Models
6. Recommendations

## Part 1: Data Sources and Preperation

In this section, I will import all the libraries I need, and then import the data file `analytic_data2025_v2.csv`.

In [5]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# path to the data file
data_path = 'analytic_data2025_v2.csv'
# load the data,skip the first row, we can use the second row as header
df = pd.read_csv(data_path, header=0)
df.head() # display the first few rows of the dataframe

  df = pd.read_csv(data_path, header=0)


Unnamed: 0,State FIPS Code,County FIPS Code,5-digit FIPS Code,State Abbreviation,Name,Release Year,County Clustered (Yes=1/No=0),Premature Death raw value,Premature Death numerator,Premature Death denominator,...,% Rural raw value,% Rural numerator,% Rural denominator,% Rural CI low,% Rural CI high,Population raw value,Population numerator,Population denominator,Population CI low,Population CI high
0,statecode,countycode,fipscode,state,county,year,county_clustered,v001_rawvalue,v001_numerator,v001_denominator,...,v058_rawvalue,v058_numerator,v058_denominator,v058_cilow,v058_cihigh,v051_rawvalue,v051_numerator,v051_denominator,v051_cilow,v051_cihigh
1,0,0,0,US,United States,2025,,8351.736549,4763989,925367214,...,0.200031371,66300254,331449281,,,334914895,,,,
2,1,0,1000,AL,Alabama,2025,,11853.24725,102760,13958454,...,0.422627605,2123399,5024279,,,5108468,,,,
3,1,1,1001,AL,Autauga County,2025,1,9938.263382,1008,163064,...,0.406768132,23920,58805,,,60342,,,,
4,1,3,1003,AL,Baldwin County,2025,1,8957.112686,3944,653515,...,0.375864554,87113,231767,,,253507,,,,


## Part 2: Feature Selection

In this section, I need to filter the appropriate columns. I need to use only columns containing the word "rawvalue" (e.g. v001_rawvalue) as features. This is done by looking for columns with the suffix `raw value` in the header.

In [6]:
# filter columns that end with '_rawvalue'
raw_value_cols = [col for col in df.columns if col.endswith('raw value')]
features = df[raw_value_cols]
print(features.columns)  # display the column names of the features

Index(['Premature Death raw value', 'Poor Physical Health Days raw value',
       'Low Birth Weight raw value', 'Poor Mental Health Days raw value',
       'Poor or Fair Health raw value', 'Flu Vaccinations raw value',
       'Access to Exercise Opportunities raw value',
       'Food Environment Index raw value', 'Primary Care Physicians raw value',
       'Mental Health Providers raw value', 'Dentists raw value',
       'Preventable Hospital Stays raw value',
       'Mammography Screening raw value', 'Uninsured raw value',
       'Severe Housing Problems raw value', 'Driving Alone to Work raw value',
       'Long Commute - Driving Alone raw value',
       'Air Pollution: Particulate Matter raw value',
       'Drinking Water Violations raw value', 'Broadband Access raw value',
       'Library Access raw value', 'Some College raw value',
       'High School Completion raw value', 'Unemployment raw value',
       'Income Inequality raw value', 'Children in Poverty raw value',
       'Inj

According to 2025 Technical Documentation, we need to delet columns from 'Population Health and Well-Being' groups, that are    

['Premature Death raw value',  'Poor Physical Health Days raw value',  'Low Birth Weight raw value',   'Poor Mental Health Days raw value',    'Poor or Fair Health raw value']

And we neeed to keep the columns from 'Community Conditions' groups, that are:

['Flu Vaccinations raw value','Access to Exercise Opportunities raw value','Food Environment Index raw value', 'Primary Care Physicians raw value','Mental Health Providers raw value', 'Dentists raw value', 'Preventable Hospital Stays raw value','Mammography Screening raw value', 'Uninsured raw value','Air Pollution: Particulate Matter raw value','Drinking Water Violations raw value', 'Broadband Access raw value', 'Library Access raw value',  'Severe Housing Problems raw value', 'Driving Alone to Work raw value', 'Long Commute - Driving Alone raw value', 'Some College raw value','High School Completion raw value', 'Unemployment raw value','Income Inequality raw value', 'Children in Poverty raw value', 'Social Associations raw value','Child Care Cost Burden raw value', 'Injury Deaths raw value']

In my perspective, there are some that is not related and can beomitted., that are:

- 'Broadband Access': Because I think this does not related to people health
- 'Driving Alone to Work raw value': hard to control by policy




In [7]:
# 根据2025技术文档，删除Population Health and Well-Being组的列
population_health_cols_to_remove = [
    'Premature Death raw value',
    'Poor Physical Health Days raw value', 
    'Low Birth Weight raw value',
    'Poor Mental Health Days raw value',
    'Poor or Fair Health raw value'
]

# 保留Community Conditions组的列
community_conditions_cols = [
    'Flu Vaccinations raw value',
    'Access to Exercise Opportunities raw value',
    'Food Environment Index raw value',
    'Primary Care Physicians raw value',
    'Mental Health Providers raw value',
    'Dentists raw value',
    'Preventable Hospital Stays raw value',
    'Mammography Screening raw value',
    'Uninsured raw value',
    'Air Pollution: Particulate Matter raw value',
    'Drinking Water Violations raw value',
    'Broadband Access raw value',
    'Library Access raw value',
    'Severe Housing Problems raw value',
    'Driving Alone to Work raw value',
    'Long Commute - Driving Alone raw value',
    'Some College raw value',
    'High School Completion raw value',
    'Unemployment raw value',
    'Income Inequality raw value',
    'Children in Poverty raw value',
    'Social Associations raw value',
    'Child Care Cost Burden raw value',
    'Injury Deaths raw value'
]

# 删除不相关的列（根据我的分析）
additional_cols_to_remove = [
    'Broadband Access raw value',  # 与健康关系不大
    'Driving Alone to Work raw value'  # 政策难以控制
]

# 首先从Community Conditions组中删除不相关的列
community_conditions_final = [col for col in community_conditions_cols if col not in additional_cols_to_remove]

# 然后从原始特征中选择最终的特征集
features_filtered = features[community_conditions_final]

print(f"原始特征数量: {len(features.columns)}")
print(f"Community Conditions组特征数量: {len(community_conditions_cols)}")
print(f"删除的不相关特征数量: {len(additional_cols_to_remove)}")
print(f"最终保留特征数量: {len(features_filtered.columns)}")
print("\n最终保留的特征列:")
for i, col in enumerate(features_filtered.columns, 1):
    print(f"{i:2d}. {col}")


原始特征数量: 90
Community Conditions组特征数量: 24
删除的不相关特征数量: 2
最终保留特征数量: 22

最终保留的特征列:
 1. Flu Vaccinations raw value
 2. Access to Exercise Opportunities raw value
 3. Food Environment Index raw value
 4. Primary Care Physicians raw value
 5. Mental Health Providers raw value
 6. Dentists raw value
 7. Preventable Hospital Stays raw value
 8. Mammography Screening raw value
 9. Uninsured raw value
10. Air Pollution: Particulate Matter raw value
11. Drinking Water Violations raw value
12. Library Access raw value
13. Severe Housing Problems raw value
14. Long Commute - Driving Alone raw value
15. Some College raw value
16. High School Completion raw value
17. Unemployment raw value
18. Income Inequality raw value
19. Children in Poverty raw value
20. Social Associations raw value
21. Child Care Cost Burden raw value
22. Injury Deaths raw value


In [8]:
# 数据清理和验证
print("=== 数据质量检查 ===")
print(f"数据集形状: {features_filtered.shape}")
print(f"行数（县数）: {features_filtered.shape[0]}")
print(f"列数（特征数）: {features_filtered.shape[1]}")

# 检查缺失值
print("\n=== 缺失值检查 ===")
missing_values = features_filtered.isnull().sum()
if missing_values.sum() == 0:
    print("✓ 没有缺失值")
else:
    print("缺失值统计:")
    for col, missing_count in missing_values[missing_values > 0].items():
        print(f"  {col}: {missing_count} ({missing_count/len(features_filtered)*100:.1f}%)")

# 检查数据类型
print("\n=== 数据类型检查 ===")
print("数据类型分布:")
print(features_filtered.dtypes.value_counts())

# 显示基本统计信息
print("\n=== 基本统计信息 ===")
print(features_filtered.describe())


=== 数据质量检查 ===
数据集形状: (3205, 22)
行数（县数）: 3205
列数（特征数）: 22

=== 缺失值检查 ===
缺失值统计:
  Flu Vaccinations raw value: 27 (0.8%)
  Access to Exercise Opportunities raw value: 55 (1.7%)
  Food Environment Index raw value: 52 (1.6%)
  Primary Care Physicians raw value: 166 (5.2%)
  Mental Health Providers raw value: 179 (5.6%)
  Dentists raw value: 96 (3.0%)
  Preventable Hospital Stays raw value: 83 (2.6%)
  Mammography Screening raw value: 31 (1.0%)
  Uninsured raw value: 9 (0.3%)
  Air Pollution: Particulate Matter raw value: 88 (2.7%)
  Drinking Water Violations raw value: 63 (2.0%)
  Library Access raw value: 137 (4.3%)
  Severe Housing Problems raw value: 8 (0.2%)
  Long Commute - Driving Alone raw value: 8 (0.2%)
  Some College raw value: 9 (0.3%)
  High School Completion raw value: 8 (0.2%)
  Unemployment raw value: 10 (0.3%)
  Income Inequality raw value: 30 (0.9%)
  Children in Poverty raw value: 9 (0.3%)
  Social Associations raw value: 8 (0.2%)
  Child Care Cost Burden raw value: 11 (

### 特征选择总结

在Part 2中，我完成了以下特征选择工作：

1. **特征选择策略**：
   - 只保留Community Conditions组的特征，这些代表社区条件，可以作为影响健康结果的因素
   - 完全排除Population Health and Well-Being组的特征，因为这些本身就是健康结果指标

2. **Community Conditions组特征**：
   - 包括医疗资源（医生、牙医、心理健康提供者等）
   - 环境条件（空气污染、饮用水违规等）
   - 社会经济因素（失业率、收入不平等、贫困儿童等）
   - 教育和社会因素（大学教育、高中完成率、社会协会等）

3. **进一步筛选不相关特征**：
   - 从Community Conditions组中删除`Broadband Access raw value`：宽带接入与健康关系不大
   - 从Community Conditions组中删除`Driving Alone to Work raw value`：独自开车上班难以通过政策控制

4. **最终特征集**：
   - 从Community Conditions组的24个特征中筛选出22个相关特征
   - 这些特征涵盖了医疗保健、环境、社会经济等多个维度
   - 为后续的聚类和分类分析提供了合适的输入特征

这个特征选择过程确保了：
- 避免数据泄露（不包含目标变量）
- 选择政策可控的特征
- 保持特征的多样性和代表性


## Part 3: Exploratory Data Analysis (EDA)

## Part 4: Clustering

## Part 5: Two Supervised Learning Models

### First model:

### Second model:

## 6. Recommendations