**@author: Daniel Ramirez Guitron**

Date: 19/04/2025

Linkdin: https://www.linkedin.com/in/danielguitron/

Github: https://github.com/dannngu

E-mail: contactguitron@gmail.com

# ⚕️ Project: Schizophrenia Detection - Initial Preprocessing Data
---

### Problem

The idea is to **train and validate a Random Forest classifier** that allows determining whether or not a person may have schizophrenia based on multiple psychosocial variables.


### Dataset
The dataset **schizophrenia.csv** contains a total of **5,610** records.

For each record (or person) we have the following information:

- `age`: the person's age
- `gender`: female (0) or male (1)
- `education`: primary (0), secondary (1), middle or high school (2), university (3), postgraduate (4)
- `marital_status`: single (0), married (1), divorced (2), widowed (3)
- `occupation`: unemployed (0), employed (1), retired (2), student (3)
- `ing_level`: low income (0), middle income (1), high income (2)
- `housing`: rural area (0), urban area (1)
- `family_history`: no relatives with schizophrenia (0), has had relatives with schizophrenia (1)
- `substance_use`: does not use tobacco, alcohol, or other substances (0), does use (1)
- `suicide_attempt`: no (0), yes (1)
- `social_environment_risk`: low (0), medium (1), high (2)
- `stress_factors`: low (0), medium (1), high (2)
- `medication_adherence`: low (0), moderate (1), good (2)

🎯 **Target Variable**
- `diagnosis`: does not have schizophrenia (0), has schizophrenia (1)


In [5]:
import pandas as pd

df = pd.read_csv('../data/raw/schizophrenia.csv')

df.sample(5)

Unnamed: 0,age,gender,education,marital_status,occupation,income_level,housing,family_history,substance_use,suicide_attempt,enviroment_risk,stressors,medication_adherence,diagnosis
2046,72,1,4,3,0,2,1,1,0,0,2,1,2,0
2148,33,1,2,2,1,0,0,0,0,0,1,1,2,0
3516,36,0,5,1,3,1,1,0,0,0,2,2,2,0
2517,54,1,5,2,2,1,0,0,0,0,0,2,1,0
446,66,1,5,1,1,0,0,1,0,0,2,0,1,1


In [6]:
df.shape

(5610, 14)

Let's explore the data types of each variable (column) as well as whether or not there is missing data:

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5610 entries, 0 to 5609
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   age                   5610 non-null   int64
 1   gender                5610 non-null   int64
 2   education             5610 non-null   int64
 3   marital_status        5610 non-null   int64
 4   occupation            5610 non-null   int64
 5   income_level          5610 non-null   int64
 6   housing               5610 non-null   int64
 7   family_history        5610 non-null   int64
 8   substance_use         5610 non-null   int64
 9   suicide_attempt       5610 non-null   int64
 10  enviroment_risk       5610 non-null   int64
 11  stressors             5610 non-null   int64
 12  medication_adherence  5610 non-null   int64
 13  diagnosis             5610 non-null   int64
dtypes: int64(14)
memory usage: 613.7 KB


Let's use Pandas `describe()` method to determine the ranges of values ​​for each variable:

In [8]:
df.describe()

Unnamed: 0,age,gender,education,marital_status,occupation,income_level,housing,family_history,substance_use,suicide_attempt,enviroment_risk,stressors,medication_adherence,diagnosis
count,5610.0,5610.0,5610.0,5610.0,5610.0,5610.0,5610.0,5610.0,5610.0,5610.0,5610.0,5610.0,5610.0,5610.0
mean,49.107308,0.501604,3.048663,1.517825,1.503922,0.990731,0.493939,0.410517,0.27041,0.157576,1.001783,1.004813,0.997504,0.514617
std,18.223271,0.500042,1.409397,1.111915,1.114613,0.812358,0.500008,0.491971,0.444211,0.364375,0.812628,0.82015,0.836198,0.499831
min,18.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,33.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,49.0,1.0,3.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
75%,65.0,1.0,4.0,3.0,3.0,2.0,1.0,1.0,1.0,0.0,2.0,2.0,2.0,1.0
max,80.0,1.0,5.0,3.0,3.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0


**Observations**
- The range of the variables are in a normal range. No outliers values found.



Now we check whether the target variables are balanced or not.

In [9]:
# Verify categorical distribution (target)
df['diagnosis'].value_counts()

diagnosis
1    2887
0    2723
Name: count, dtype: int64

**Observations**

- Target varibiable doesn't have a unbalanced distribution. 

## Final Conslusions
--- 

In this case the dataset is perfect, we don't need to make any changes to the data and since we are going to use a **TreeClassifer** it is not necessary to classify the variables and we do not have outliers or values ​​that do not make sense in range terms. 