<a href="https://colab.research.google.com/github/aiseongjun/Hands-On-Machine-Learning/blob/main/toyproject_idea.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. 라이프스타일에 따른 GPA 데이터**

## **데이터셋**

https://www.kaggle.com/datasets/steve1215rogg/student-lifestyle-dataset/data

In [None]:
import pandas as pd

gpa = pd.read_csv('/content/student_lifestyle_dataset.csv')
gpa.head()

Unnamed: 0,Student_ID,Study_Hours_Per_Day,Extracurricular_Hours_Per_Day,Sleep_Hours_Per_Day,Social_Hours_Per_Day,Physical_Activity_Hours_Per_Day,GPA,Stress_Level
0,1,6.9,3.8,8.7,2.8,1.8,2.99,Moderate
1,2,5.3,3.5,8.0,4.2,3.0,2.75,Low
2,3,5.1,3.9,9.2,1.2,4.6,2.67,Low
3,4,6.5,2.1,7.2,1.7,6.5,2.88,Moderate
4,5,8.1,0.6,6.5,2.2,6.6,3.51,High


In [None]:
gpa.columns

Index(['Student_ID', 'Study_Hours_Per_Day', 'Extracurricular_Hours_Per_Day',
       'Sleep_Hours_Per_Day', 'Social_Hours_Per_Day',
       'Physical_Activity_Hours_Per_Day', 'GPA', 'Stress_Level'],
      dtype='object')

* 'Student_ID': 학생 고유 번호
* 'Study_Hours_Per_Day': 하루 공부 시간
* 'Extracurricular_Hours_Per_Day': 하루 중 과외 활동 시간
* 'Sleep_Hours_Per_Day': 하루 수면 시간
* 'Social_Hours_Per_Day': 하루 사회적 활동 시간
* 'Physical_Activity_Hours_Per_Day': 하루 신체 활동 시간
* 'Stress_Level': 스트레스 수치
* 'GPA': 학점

## **프로젝트 목적**

```
이 프로젝트의 목표는 대학생들의 다양한 라이프스타일 요소가 GPA에 미치는 영향을 분석하고,
이를 통해 GPA를 예측하는 모델을 만드는 것
```

## **간단한 모델**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

gpa = pd.get_dummies(gpa, columns=['Stress_Level'], prefix='Stress_Level')
X = gpa.drop(['Student_ID', 'GPA'], axis=1)
y = gpa['GPA']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(random_state=42).fit(X_train, y_train)

feature_importances = pd.Series(rf.feature_importances_, index=X.columns)
print('특성 중요도')
print(feature_importances)

특성 중요도
Study_Hours_Per_Day                0.597014
Extracurricular_Hours_Per_Day      0.098186
Sleep_Hours_Per_Day                0.094497
Social_Hours_Per_Day               0.103790
Physical_Activity_Hours_Per_Day    0.100714
Stress_Level_High                  0.002252
Stress_Level_Low                   0.001111
Stress_Level_Moderate              0.002437
dtype: float64


In [None]:
rf.score(X_test, y_test)

0.4412449841822442

# **2. 우울증 데이터**

## **데이터셋**

https://www.kaggle.com/datasets/anthonytherrien/depression-dataset

In [None]:
import pandas as pd

depression = pd.read_csv('/content/depression_data.csv')
depression.head()

Unnamed: 0,Name,Age,Marital Status,Education Level,Number of Children,Smoking Status,Physical Activity Level,Employment Status,Income,Alcohol Consumption,Dietary Habits,Sleep Patterns,History of Mental Illness,History of Substance Abuse,Family History of Depression,Chronic Medical Conditions
0,Christine Barker,31,Married,Bachelor's Degree,2,Non-smoker,Active,Unemployed,26265.67,Moderate,Moderate,Fair,Yes,No,Yes,Yes
1,Jacqueline Lewis,55,Married,High School,1,Non-smoker,Sedentary,Employed,42710.36,High,Unhealthy,Fair,Yes,No,No,Yes
2,Shannon Church,78,Widowed,Master's Degree,1,Non-smoker,Sedentary,Employed,125332.79,Low,Unhealthy,Good,No,No,Yes,No
3,Charles Jordan,58,Divorced,Master's Degree,3,Non-smoker,Moderate,Unemployed,9992.78,Moderate,Moderate,Poor,No,No,No,No
4,Michael Rich,18,Single,High School,0,Non-smoker,Sedentary,Unemployed,8595.08,Low,Moderate,Fair,Yes,No,Yes,Yes


In [None]:
depression.columns

Index(['Name', 'Age', 'Marital Status', 'Education Level',
       'Number of Children', 'Smoking Status', 'Physical Activity Level',
       'Employment Status', 'Income', 'Alcohol Consumption', 'Dietary Habits',
       'Sleep Patterns', 'History of Mental Illness',
       'History of Substance Abuse', 'Family History of Depression',
       'Chronic Medical Conditions'],
      dtype='object')

* 'Name': 이름
* 'Age': 나이
* 'Marital Status': 결혼 상태
* 'Education Level': 학력 수준
* 'Number of Children': 자녀 수
* 'Smoking Status': 흡연 여부
* 'Physical Activity Level': 신체 활동 수준
* 'Employment Status': 고용 상태
* 'Income': 소득
* 'Alcohol Consumption': 알콜 소비량
* 'Dietary Habits': 식습관
* 'Sleep Patterns': 수면 패턴
* 'History of Mental Illness': 정신 질환 병력
* 'History of Substance Abuse': 약물 남용 병력
* 'Family History of Depression': 우울증 가족력
* 'Chronic Medical Conditions': 만성 질환

## **프로젝트 목적**

```
이 프로젝트의 목표는 각 특성들이 우울증을 유발하는 데 끼치는 영향을 분석하고,
이를 통해 우울증에 취약한 사람을 조기에 식별할 수 있는 모델을 만드는 것입니다.
```

## **간단한 모델**

In [None]:
depression.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413768 entries, 0 to 413767
Data columns (total 16 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Name                          413768 non-null  object 
 1   Age                           413768 non-null  int64  
 2   Marital Status                413768 non-null  object 
 3   Education Level               413768 non-null  object 
 4   Number of Children            413768 non-null  int64  
 5   Smoking Status                413768 non-null  object 
 6   Physical Activity Level       413768 non-null  object 
 7   Employment Status             413768 non-null  object 
 8   Income                        413768 non-null  float64
 9   Alcohol Consumption           413768 non-null  object 
 10  Dietary Habits                413768 non-null  object 
 11  Sleep Patterns                413768 non-null  object 
 12  History of Mental Illness     413768 non-nul

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

depression = pd.get_dummies(depression, columns=['Marital Status', 'Education Level', 'Smoking Status', 'Physical Activity Level',
                                   'Employment Status', 'Alcohol Consumption', 'Dietary Habits', 'Sleep Patterns',
                                   'History of Mental Illness', 'History of Substance Abuse',
                                   'Family History of Depression'],
                      prefix=['Marital Status', 'Education Level', 'Smoking Status', 'Physical Activity Level',
                              'Employment Status', 'Alcohol Consumption', 'Dietary Habits', 'Sleep Patterns',
                              'History of Mental Illness', 'History of Substance Abuse',
                              'Family History of Depression'])
X = depression.drop(['Name', 'Chronic Medical Conditions'], axis=1)
y = depression['Chronic Medical Conditions']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestClassifier(random_state=42).fit(X_train, y_train)

feature_importances = pd.Series(rf.feature_importances_, index=X.columns)
print('특성 중요도')
print(feature_importances)

특성 중요도
Age                                  0.283612
Number of Children                   0.068137
Income                               0.384143
Marital Status_Divorced              0.006094
Marital Status_Married               0.009556
Marital Status_Single                0.003692
Marital Status_Widowed               0.006379
Education Level_Associate Degree     0.009364
Education Level_Bachelor's Degree    0.010722
Education Level_High School          0.009304
Education Level_Master's Degree      0.009414
Education Level_PhD                  0.004751
Smoking Status_Current               0.003577
Smoking Status_Former                0.007349
Smoking Status_Non-smoker            0.007391
Physical Activity Level_Active       0.006116
Physical Activity Level_Moderate     0.007637
Physical Activity Level_Sedentary    0.006242
Employment Status_Employed           0.002261
Employment Status_Unemployed         0.002224
Alcohol Consumption_High             0.009651
Alcohol Consumption_Low    

In [None]:
rf.score(X_test, y_test)

0.6153660246030404

# **3. 프리미어리그 우승팀**

https://www.kaggle.com/datasets/panaaaaa/english-premier-league-and-championship-full-dataset?select=England+CSV.csv