Let's build a better model to predict autism in adults (aged 18+) and analyze its performance!

"Class/ASD" is the result of scoring 7 or more among the A1 - A10 columns (values are binary; that is, they are answered as 1 or 0). The "austim" (misspelled in the source dataset, later renamed correctly to "autism") is the assumed true value, where the individuals self-disclose whether they have already been diagnosed with autism.

Actual questions for A1-A10 is shown below after displaying the dataset.

Key to the questionnaire and its source: https://wchh.onlinelibrary.wiley.com/doi/10.1002/psb.1816

The dataset and more information: 
https://www.kaggle.com/datasets/andrewmvd/autism-screening-on-adults



<b>Part 1. Exploratory data analysis</b>

Before doing anything, let's import some typical libraries and take a look at the dataset, the column names, datatypes of each column, and see if there are any missing values or outliers we should take care of first.

In [None]:
# Let's import libraries to get started
import numpy as np
import pandas as pd

In [None]:
# Let's load the dataset and take a look

# I'll keep a copy of the raw data first
df_raw = pd.read_csv("Autism_Dataraw.csv")

# Let's work with 'df'
df = df_raw
df.head()

For reference on the A1 - A10 scores, according to the questionnaire:


"SCORING: Only 1 point can be scored for each question. Score 1 point for Definitely or Slightly agree on each of items 1, 7, 8, and 10. Score 1 point for Definitely or Slightly Disagree on each of items 2, 3, 4, 5, 6, and 9. If the individual scores more than 6 out of 10, consider referring them for a specialist diagnostic assessment."
 
"Please tick one option per question only:
Definitely agree
Slightly agree
Slightly disagree
Definitely disagree"

1
I often notice small sounds when others do not.
 
2
I usually concentrate more on the whole picture, rather than the small details.
   
3
I find it easy to do more than one thing at once
  
4
If there is an interruption, I can switch back to what I was doing very quickly
  
5
I find it easy to ‘read between the lines’ when someone is talking to me
 
6
I know how to tell if someone listening to me is getting bored
 
7
When I’m reading a story I find it difficult to work out the characters’ intentions
 
8
I like to collect information about categories of things (e.g. types of car, bird, train, plant etc.)
 
9
I find it easy to work out what someone is thinking or feeling just by looking at their face
  
10
I find it difficult to work out people’s intentions

In [None]:
# Let's look at all the columns to see 
# what we're working with
df.columns

In [None]:
# Let's correct some column misspellings
df = df.rename(
    {'austim': 'autism',
     'jundice': 'jaundice',
     'contry_of_res': 'country_of_res'},
    axis = 'columns'
)

# Check to see if all columns are now spelled correctly
df.columns

In [None]:
# And the datatypes of each?
df.dtypes

In [None]:
# The "object" type columns may indicate some issues
# regarding columns that presumably should be int64,
# like the age column. So let's clean the data.
# To start, are there any missing values?
print('Total missing values:', df.isnull().sum().sum())

In [None]:
# Great, but I noticed there are many values
# with a '?'. What columns contain a '?'?
df.columns[df.isin(['?']).any()]

In [None]:
# Let's take care of these, starting with age.

# Briefly describe the age column
display(df['age'].describe())

# What type of values are in the age column?
display(df['age'].apply(type).unique())

In [None]:
# On first glance it can be seen above that 21 is the
# most common age group that took this questionnaire, 
# among 47 different ages. The str type probably is 
# due to the presence of the '?'. Let's show the 
# unique age values.
df['age'].value_counts()

In [None]:
# There's age 383 once and age '?' twice.
# Considering these are a total of 3 values,
# I consider this relatively insignificant.
# Let's replace these with the mode age.

# Determine the mode(age)
age_mode = int(df['age'].mode()[0])
print('Mode of age column:', age_mode)

In [None]:
# Replace the '?' with the mode.
df['age']= df['age'].replace({'?': age_mode})

In [None]:
# Now we can change the datatype of the 
# age column values to int, so that we can
# then replace the 383 value with the mode.
df['age'] = (df['age'].values.astype(int))
df['age']= df['age'].replace({383: age_mode})

In [None]:
# Check to see if age column is cleansed
df['age'].value_counts()

In [None]:
# Now let's investigate ethnicity.
# What are its value counts?
df['ethnicity'].value_counts()

In [None]:
# There are many issues here. First we can 
# combine "others" with "Others"
df['ethnicity'] = df['ethnicity'].replace('others', 'Others')

# Check
df['ethnicity'].value_counts()

In [None]:
# Let's remove the quotations ''
df['ethnicity'] = df['ethnicity'].str.strip("''")

# Check
df['ethnicity'].value_counts()

In [None]:
# We could replace '?' with the mode (White-European),
# but this may come acrossed as a form as bias.
# To get some more insight, let's look at 
# country of residence.
df['country_of_res'].value_counts()

In [None]:
# Let's get rid of the quotations
df['country_of_res'] = df['country_of_res'].str.strip("''")
df['country_of_res'].value_counts()

In [None]:
# Check that the country of residence column has no '?'values
df[df['country_of_res'] == '?']['country_of_res']

In [None]:
# Let's take a closer look at how many "?" per country
df[['country_of_res', 'ethnicity']].sort_values(['country_of_res', 'ethnicity'], ascending = True).head(20)

In [None]:
# It can be seen that some countries don't have any ethnicities.
# So let's replace those that have ? with the most common
# ethnicity in that country. If a country has all '?', replacing with 
# 'Others' seems like a reasonable assumption.
df['ethnicity'] =df.groupby('country_of_res')['ethnicity'].apply(
    lambda x: x.replace('?', 'Others') if (x == '?').all() else x.replace('?', x[x != '?'].mode()[0])
)

# Check there are no more '?'
df['ethnicity'].value_counts()



In [None]:
# And let's see these two columns again
df[['country_of_res', 'ethnicity']].sort_values(['country_of_res', 'ethnicity'], ascending = True).head(20)

In [None]:
# Looking good. Now, the last column with '?' is relation.
# Lets look
df['relation'].value_counts()

In [None]:
# As usual let's strip any quotations.
df['relation'] = df['relation'].str.strip("''")
df['relation'].value_counts()

In [None]:
# Relation is who is taking the test. There doesn't seem to be
# any other columns that give us clues as to what to fill these with.
# So we'll go ahead and replace with 'Unknown' for clarity.
df['relation'] =df['relation'].replace('?', 'Unknown')
df['relation'].value_counts()


In [None]:
# Let's check to see there are no more '?'
df[df.values == '?']

In [None]:
# Let's take a final bird's eye view to check value counts
# for each.
for column in df.columns:
    print(df[column].value_counts())
    
    # for spacing
    print("")

In [None]:
# age_desc is constant. We can drop this.

# Should be true if all values in age_desc are the same
display(df.shape[0] == df['age_desc'].value_counts()[0])
df = df.drop('age_desc', axis = "columns")

In [None]:
# Lastly, for consistency, let's lowercase the Class/ASD column.
df['Class/ASD']=df['Class/ASD'].str.lower()

In [None]:
# Let's revisit our columns and ensure everything is clean.
display(df.columns)
display(df.head())

<b> Part 2. How accurate is the questionnaire? </b>
Before implementing a better model, let's analyze how well the questionnaire alone did at predicting autism.

In [None]:
# First let's calculate how many times the questionnaire was off.

# Show first few rows of columns we are comparing
display(df[['autism', 'Class/ASD']].head())

# Number of correct classifications from the questionnaire
num_correct_from_q = np.count_nonzero(
    df['autism'] == df['Class/ASD']
)

# Show the results
print("Number of correct classifications:", num_correct_from_q, "out of", df.shape[0])
print("Percentage correct:", np.round(num_correct_from_q/df.shape[0] * 100))

In [None]:
# Lets visualize the error.

# Import some more libraries.
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, cohen_kappa_score

In [None]:
# Create confusion matrix
conf_matrix_original = confusion_matrix(df['autism'], df['Class/ASD'], labels=['yes', 'no'])

# Visualization using Seaborn heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix_original, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted Yes', 'Predicted No'], 
            yticklabels=['True Yes', 'True No'])
plt.title('Confusion Matrix between Actual and Predicted Autism from Questionnaire Alone')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


In [None]:
# Is the model better than random?

# Compute the kappa statistic
questionnaire_kappa_stat = cohen_kappa_score(df['autism'],df['Class/ASD'])
questionnaire_kappa_stat

The model is slightly better than random.

In [None]:
# Visualized another way
# Create a count of true/false predictions
comparison_counts = df.groupby(['autism', 'Class/ASD']).size().reset_index(name='count')


# Replace "yes" and "no" in the 'autism' column with the desired labels
label_map_autism = {'yes': 'autistic', 'no': 'not autistic'}
comparison_counts['autism'] = comparison_counts['autism'].map(label_map_autism)

# Replace "yes" and "no" in the 'Class/ASD' column with 'autistic' and 'not autistic'
label_map_class = {'yes': 'autistic', 'no': 'not autistic'}
comparison_counts['Class/ASD'] = comparison_counts['Class/ASD'].map(label_map_class)



# Create a bar plot
plt.figure(figsize=(8, 5))
sns.barplot(x='autism', y='count', hue='Class/ASD', data=comparison_counts)
plt.title('Comparison of True (Autism) and Predicted (Class/ASD)')
plt.xlabel('Type of patient (from actual diagnosis)')
plt.ylabel('Count')
plt.legend(title='Questionnaire Prediction')
plt.show()


It can be seen now that the questionnaire predicts autistic patients better than non-autistic. The questionnaire is also slighty better than random. It would benefit if the dataset were larger, but this is what we have to work with. Let's see if we can at least improve the predictions for non-autistic.

<b> Part 3. Building a better model </b> Now let's see if we can increase the percentage of correctly classified patients using the dataframe we cleansed.

In [None]:
# Show the dataframe again for convenience
display(df.head())
display(df.columns)
display(df.dtypes)

Let's determine what features to use in our model. Let's go one by one, starting with ethnicity.

In [None]:
# A visualization might help.

# Create a count plot
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='ethnicity', hue='autism')

# Add title and labels
plt.title('Distribution of Ethnicity by Autism Status', fontsize=16)
plt.xlabel('Ethnicity', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.legend(title='Autism', loc='upper right', labels=['No', 'Yes'])
plt.xticks(rotation=45)
plt.tight_layout()

# Show the plot
plt.show()


It may appear that Latinos have a higher rate of autism, but it's a small sample size. Therefore it appears ethnicity has little to do with autism. What about jaundice?

In [None]:
# Create a count plot
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='jaundice', hue='autism')

# Add title and labels
plt.title('Jaundice Status Among Individuals with Autism', fontsize=16)
plt.xlabel('Jaundice', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.legend(title='Autism', loc='upper right', labels=['No', 'Yes'])
plt.xticks(rotation=0)
plt.tight_layout()

# Show the plot
plt.show()


It may be that those with autism tend to have jaundice. But again, not much of an association given the small sample size.  Let's try age.

In [None]:
# Define the age bins
bins = range(0, df['age'].max() + 10, 10)  # Adjust based on max age
labels = [f"{i}-{i+9}" for i in bins[:-1]]  # Create labels for the bins

# Create a count plot
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x=pd.cut(df['age'], bins=bins, labels=labels, right=False), hue='autism')

# Add title and labels
plt.title('Distribution of Autism by Age Groups', fontsize=16)
plt.xlabel('Age Group', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.legend(title='Autism', loc='upper right', labels=['No', 'Yes'])
plt.xticks(rotation=45)
plt.tight_layout()

# Show the plot
plt.show()



It appears there may be something going on within the 30-39 age group. These seem to have a higher rate of autism, but again this is a small sample size.

In [None]:
# Create a count plot for gender vs autism
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='gender', hue='autism')

# Add title and labels
plt.title('Distribution of Autism by Gender', fontsize=16)
plt.xlabel('Gender', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.legend(title='Autism', loc='upper right', labels=['No', 'Yes'])
plt.xticks(rotation=0)
plt.tight_layout()

# Show the plot
plt.show()


As shown, females appear to have a higher rate of autism. Now let's analyze each question.

In [None]:
# List of questions
questions = ['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 
             'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score']

# Initialize a list to store percent correct
percent_correct = []

# Loop through each question and calculate percent correct
for question in questions:
    percent = np.count_nonzero(df[df['autism'] == 'yes'][question] == 1) / len(df[df['autism'] == 'yes']) * 100
    percent_correct.append(percent)  # Append the result to the list

# Store these as a new DataFrame
most_predictive_questions = pd.DataFrame(
    {'Question': questions,
     "autistic_percent_answered_'1'": percent_correct
    }
).sort_values("autistic_percent_answered_'1'", ascending=False)

# Display the DataFrame
most_predictive_questions



Shown above are the questions sorted in descending order by most predictive questions (potentially). Note that per the key, questions could mean either agree or disagree depending on the question.


Given everything we've learned about this dataset, let's try some modeling, starting with logistic regression.

In [None]:
# Import some libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
# Let's encode some features as binary
df['jaundice_binary'] = df['jaundice'].map({'yes': 1, 'no': 0})
df['autism_binary'] = df['autism'].map({'yes': 1, 'no': 0})
df['is_female'] = df['gender'].map({'f': 1, 'm': 0})
df['is_male'] = df['gender'].map({'m': 1, 'f': 0})
df['class_binary'] = df['Class/ASD'].map({'yes': 1, 'no':0})

# According to the key, scores greater than 6 are suggested to
# be referred to a health professional
df['score_more_than_6'] = (df['result'] > 6).astype(int)

In [None]:
# Create binary columns for each specific age bin
df['is_10_to_19_yrs_old'] = ((df['age'] >= 10) & (df['age'] <= 19)).astype(int)
df['is_20_to_29_yrs_old'] = ((df['age'] >= 20) & (df['age'] <= 29)).astype(int)
df['is_30_to_39_yrs_old'] = ((df['age'] >= 30) & (df['age'] <= 39)).astype(int)
df['is_40_to_49_yrs_old'] = ((df['age'] >= 40) & (df['age'] <= 49)).astype(int)
df['is_50_to_59_yrs_old'] = ((df['age'] >= 50) & (df['age'] <= 59)).astype(int)
df['is_60_to_69_yrs_old'] = ((df['age'] >= 60) & (df['age'] <= 69)).astype(int)

In [None]:
# One-hot encode the ethnicities and store 
df= pd.get_dummies(df, columns=['ethnicity'], prefix='', prefix_sep='')

In [None]:
# Check out the dataframe
display(df.head())
display(df.columns)

In [None]:
# It looks like Middle Eastern has a slight typo
df.rename(columns={'Middle Eastern ': 'Middle Eastern'}, inplace=True)

Now, let's use feature selection to determine what features have the most potential predictive power.

In [None]:
# List out all binary features
features = ['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score',
            'A5_Score', 'A6_Score','A7_Score', 'A8_Score',
            'A9_Score', 'A10_Score','jaundice_binary',
            'is_female', 'is_male',
            'score_more_than_6','is_10_to_19_yrs_old',
            'is_20_to_29_yrs_old', 'is_30_to_39_yrs_old',
            'is_40_to_49_yrs_old', 'is_50_to_59_yrs_old',
            'is_60_to_69_yrs_old','Asian', 'Black',
            'Hispanic', 'Latino', 'Middle Eastern',
            'Others','Pasifika', 'South Asian', 'Turkish',
            'White-European']

In [None]:
# Let's try chi-squared test

In [None]:
from sklearn.feature_selection import RFE

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

# Assuming X and y are defined as before
selector = SelectKBest(score_func=chi2, k='all')  # You can specify the number of features with k
selector.fit(X, y)

# Get the scores and features
chi2_scores = selector.scores_
features = X.columns[selector.get_support()]

# Display feature scores
for feature, score in zip(X.columns, chi2_scores):
    print(f"{feature}: {score}")


It looks like we should add more features. Let's try to get our model closer to, and ideally greater than (in terms of absolute value), the original kappa stat.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(estimator=model, n_features_to_select=5)  # Choose the number of features to keep
rfe.fit(X, y)

selected_features = X.columns[rfe.support_]
print(f"Selected features: {selected_features}")


In [None]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Train a Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X, y)

# Get feature importances
importances = rf_model.feature_importances_

# Create a DataFrame for better visualization
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feature_importances.sort_values(by='Importance', ascending=False, inplace=True)

print(feature_importances)


In [None]:
# Let's see how these two features do.
features = ['jaundice_binary','is_female', 'is_male', 'score_more_than_6',
       'is_10_to_19_yrs_old', 'is_20_to_29_yrs_old', 'is_30_to_39_yrs_old',
       'is_40_to_49_yrs_old', 'is_50_to_59_yrs_old', 'is_60_to_69_yrs_old',]
target = 'autism_binary'

# Prepare the features and target variable
X = df[features]
y = df[target] 


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Create confusion matrix for logistic regression
conf_matrix = confusion_matrix(y_test, y_pred)

# Display confusion matrix for logistic regression
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted No', 'Predicted Yes'], 
            yticklabels=['True No', 'True Yes'])  # Consistent labeling
plt.title('Confusion Matrix for Autism Prediction (Logistic Regression)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Print the confusion matrix for logistic regression
print("Logistic Regression Confusion Matrix:\n", conf_matrix)
print("Kappa stat:", cohen_kappa_score(y_test, y_pred))

# Create confusion matrix for original data
conf_matrix_original = confusion_matrix(df['autism'], df['Class/ASD'], labels=['no', 'yes'])

# Display confusion matrix for original data
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix_original, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted No', 'Predicted Yes'], 
            yticklabels=['True No', 'True Yes'])  # Consistent labeling
plt.title('Confusion Matrix between Actual and Predicted Autism from Questionnaire')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Print the confusion matrix for original data
print("Original Confusion Matrix:\n", conf_matrix_original)
print("Original kappa stat:", questionnaire_kappa_stat)


In [None]:
# Assume df is your DataFrame and 'jaundice' is already encoded as 0 or 1.
features = ['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 
            'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score',
            'jaundice_binary','is_30_to_39_yrs_old', 'is_female', 'is_male',
            'Asian', 'Black','Hispanic', 'Latino', 'Middle Eastern ',
            'Others', 'Pasifika','South Asian', 'Turkish',
            'White-European']
target = 'autism_binary'

# Prepare the features and target variable
X = df_one_hot[features]
y = df_one_hot[target] 



# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Create confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Display confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted No', 'Predicted Yes'], 
            yticklabels=['True No', 'True Yes'])
plt.title('Confusion Matrix for Autism Prediction')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Print the confusion matrix
print("Confusion Matrix:\n", conf_matrix)
print("Kappa stat:", cohen_kappa_score(y_test,y_pred))
print("Questionnaire kappa stat:", questionnaire_kappa_stat)

In [None]:
# Assume df is your DataFrame and 'jaundice' is already encoded as 0 or 1.
features = ['class_binary','jaundice_binary', 'is_female', 'is_male',
            'Asian', 'Black','Hispanic', 'Latino', 'Middle Eastern ',
            'Others', 'Pasifika','South Asian', 'Turkish',
            'White-European']
target = 'autism_binary'

# Prepare the features and target variable
X = df_one_hot[features]
y = df_one_hot[target] 



# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Create confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Display confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted No', 'Predicted Yes'], 
            yticklabels=['True No', 'True Yes'])
plt.title('Confusion Matrix for Autism Prediction')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Print the confusion matrix
print("Confusion Matrix:\n", conf_matrix)
print("Kappa stat:", cohen_kappa_score(y_test,y_pred))
print("Questionnaire kappa stat:", questionnaire_kappa_stat)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns

# Features and target from your dataset
features = ['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 
            'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score',
            'jaundice_binary', 'is_female', 'is_male', 'Asian', 'Black','Hispanic', 'Latino', 'Middle Eastern ',
            'Others', 'Pasifika','South Asian', 'Turkish',
            'White-European', 'is_30_to_39_yrs_old']
target = 'autism_binary'

# Prepare the features and target variable
X = df_one_hot[features]
y = df_one_hot[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the Decision Tree model
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)

# Make predictions
y_pred = tree_model.predict(X_test)

# Create confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Display confusion matrix with a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted No', 'Predicted Yes'], 
            yticklabels=['True No', 'True Yes'])
plt.title('Confusion Matrix for Autism Prediction (Decision Tree)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Questionnaire kappa stat:", questionnaire_kappa_stat)
print("This Model's kappa stat:", cohen_kappa_score((y_test), (y_pred)))


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns

# Features and target from your dataset
features = ['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 
            'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score',
            'jaundice_binary', 'is_female', 'is_male', 'Asian', 'Black','Hispanic', 'Latino', 'Middle Eastern ',
            'Others', 'Pasifika','South Asian', 'Turkish',
            'White-European']
target = 'autism_binary'

# Prepare the features and target variable
X = df_one_hot[features]
y = df_one_hot[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the Decision Tree model
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)

# Make predictions
y_pred = tree_model.predict(X_test)

# Create confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Display confusion matrix with a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted No', 'Predicted Yes'], 
            yticklabels=['True No', 'True Yes'])
plt.title('Confusion Matrix for Autism Prediction (Decision Tree)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Print the confusion matrix and classification report
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Questionnaire kappa stat:", questionnaire_kappa_stat)
print("This Model's kappa stat:", cohen_kappa_score((y_test), (y_pred)))