In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
df = pd.read_csv('hoog.csv')

In [24]:
df.head(30)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
2,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
3,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
4,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
5,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S
6,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
7,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
8,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
9,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C


df.shape

# Assistant
To perform the statistical tests you've mentioned (t-test, ANOVA, MANOVA, and chi-square) on the given dataframe, you can use the following Python packages:

1. **T-test**: You can use the `scipy.stats.ttest_ind` function to perform a t-test to compare the means of two groups, such as comparing the 'Age' of passengers who survived versus those who did not.

2. **ANOVA**: The `scipy.stats.f_oneway` function can be used to perform a one-way ANOVA to compare the means of 'Age' across multiple levels of the 'Pclass' variable.

3. **MANOVA**: Performing MANOVA in Python can be a bit more involved, and it typically requires the use of specialized statistical packages such as R or specific Python libraries like `statsmodels`. This test is used when you have multiple dependent variables, so you would need to define those variables and their relationships to the independent variables in your dataset.

4. **Chi-square test**: You can use the `scipy.stats.chi2_contingency` function to perform a chi-square test of independence to determine if there is a significant association between two categorical variables, such as 'Survived' and 'Sex'.

If you have specific hypotheses or comparisons in mind, please let me know and I can provide you with example code for conducting these tests on the given dataframe.

# Assistant
It looks like you're interested in statistical tests commonly used in data analysis. Here's a brief overview of each test:

1. **T-test**: The t-test is used to determine if there is a significant difference between the means of two groups. It is commonly used to compare the means of a continuous variable between two groups.

2. **ANOVA (Analysis of Variance)**: ANOVA is used to compare the means of three or more groups to determine if there is a statistically significant difference between them. It can be used to compare means across multiple groups or levels of a categorical variable.

3. **MANOVA (Multivariate Analysis of Variance)**: MANOVA is an extension of ANOVA that allows for the comparison of means on multiple dependent variables simultaneously. It is used when there are two or more dependent variables.

4. **Chi-square test**: The chi-square test is used to determine if there is a significant association between two categorical variables. It is often used to test for independence between two variables.

In Python, these tests can be performed using the `scipy.stats` module for t-test, ANOVA, and chi-square test. For MANOVA, the `statsmodels` package provides functionality for conducting multivariate analysis of variance.

If you have specific data and would like to see an example of how to perform any of these tests, please let me know!

In [29]:
import pandas as pd
import scipy.stats as stats

# Assuming 'df' is the dataframe containing the given information

# T-test
survived_age = df[df['Survived'] == 1]['Age'].dropna()
not_survived_age = df[df['Survived'] == 0]['Age'].dropna()
t_stat, p_value = stats.ttest_ind(survived_age, not_survived_age)
print("T-test - Mean Age of Survived vs. Not Survived:")
print("T-statistic:", t_stat)
print("P-value:", p_value)

# ANOVA
class1_fare = df[df['Pclass'] == 1]['Fare'].dropna()
class2_fare = df[df['Pclass'] == 2]['Fare'].dropna()
class3_fare = df[df['Pclass'] == 3]['Fare'].dropna()
f_stat, p_value = stats.f_oneway(class1_fare, class2_fare, class3_fare)
print("\nANOVA - Mean Fare across Passenger Classes:")
print("F-statistic:", f_stat)
print("P-value:", p_value)

# Chi-square test
contingency_table = pd.crosstab(df['Survived'], df['Sex'])
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print("\nChi-square Test - Association between Survived and Sex:")
print("Chi-square:", chi2)
print("P-value:", p)

T-test - Mean Age of Survived vs. Not Survived:
T-statistic: -3.53435125095576
P-value: 0.0005189503307881672

ANOVA - Mean Fare across Passenger Classes:
F-statistic: 11.049449306463396
P-value: 2.9769411802232758e-05

Chi-square Test - Association between Survived and Sex:
Chi-square: 49.62965365707783
P-value: 1.8568580662867508e-12


In [44]:
pd.CategoricalIndex(df)

CategoricalIndex(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
                  'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
                 categories=['Age', 'Cabin', 'Embarked', 'Fare', ..., 'Sex', 'SibSp', 'Survived', 'Ticket'], ordered=False, dtype='category')

In [46]:
pd.Categorical(df)

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', ..., 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
Length: 12
Categories (12, object): ['Age', 'Cabin', 'Embarked', 'Fare', ..., 'Sex', 'SibSp', 'Survived', 'Ticket']

In [48]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')