In [1]:
import pandas as pd

# Load the data from the uploaded CSV file
data_path = 'clean_FineTech_appData.csv'
customer_data = pd.read_csv(data_path)

# Display the first few rows of the dataframe and the data types of each column
customer_data.head(), customer_data.dtypes


(   Unnamed: 0    user  dayofweek  hour  age  numscreens  minigame  \
 0           0  235136          3     2   23          15         0   
 1           1  333588          6     1   24          13         0   
 2           2  254414          1    19   23           3         0   
 3           3  234192          4    16   28          40         0   
 4           4   51549          1    18   31          32         0   
 
    used_premium_feature  enrolled  liked  ...  SecurityModal  ResendToken  \
 0                     0         0      0  ...              0            0   
 1                     0         0      0  ...              0            0   
 2                     1         0      1  ...              0            0   
 3                     0         1      0  ...              0            0   
 4                     0         1      1  ...              0            0   
 
    TransactionList  NetworkFailure  ListPicker  remain_screen_list  \
 0                0               0  

In [2]:
from scipy.stats import chi2_contingency

# Identify all binary columns
binary_columns = [col for col in customer_data.columns if customer_data[col].nunique() == 2]

# Create a dictionary to store p-values from chi-square tests
chi_square_results = {}

for col in binary_columns:
    # Constructing the contingency table
    contingency_table = pd.crosstab(customer_data[col], customer_data['enrolled'])
    # Perform the Chi-square test
    chi2, p, dof, ex = chi2_contingency(contingency_table)
    # Store the p-value corresponding to the test
    chi_square_results[col] = p

# Filter results to show only those variables with significant p-values (p < 0.05)
significant_chi_square_results = {k: v for k, v in chi_square_results.items() if v < 0.05}
significant_chi_square_results

{'minigame': 0.022415479101004665,
 'used_premium_feature': 1.0917913817684498e-26,
 'enrolled': 0.0,
 'location': 0.0,
 'Institutions': 0.04719787373062467,
 'VerifyPhone': 0.0,
 'BankVerification': 0.0,
 'VerifyDateOfBirth': 0.0,
 'ProfilePage': 2.2607178205019624e-09,
 'VerifyCountry': 0.0,
 'Cycle': 1.2348732488788239e-92,
 'idscreen': 0.0,
 'Splash': 1.360676606225607e-17,
 'RewardsContainer': 0.015451647138783517,
 'Finances': 1.0197685730629454e-28,
 'Alerts': 5.958678697692519e-225,
 'Leaderboard': 0.00475562326061741,
 'VerifyMobile': 1.751896295185995e-257,
 'VerifyHousing': 1.300135850208171e-18,
 'RewardDetail': 2.486167321485173e-06,
 'VerifyHousingAmount': 4.873439290375567e-20,
 'ProfileMaritalStatus': 0.001279298998704523,
 'ProfileEducation': 8.343952896849845e-06,
 'ProfileEducationMajor': 1.2842006994438508e-07,
 'Rewards': 1.4935015108085087e-15,
 'AccountView': 1.0093294904600182e-48,
 'VerifyAnnualIncome': 2.98679005829175e-21,
 'VerifyIncomeType': 1.0486393991598

The Chi-square test results show that many binary variables are statistically significant with respect to the enrolled status (p-value < 0.05). 

Some of these significant variables include:

Usage of premium features (used_premium_feature)
Verification of phone (VerifyPhone), date of birth (VerifyDateOfBirth), and other personal information
Engagement with different app features like minigame, alerts, finances, and more


This suggests these features are related to whether a user is likely to enroll or not. Each of these significant relationships indicates a dependency between user interactions/features and their enrollment status.

In [3]:
from scipy.stats import mannwhitneyu

# Identify numeric columns (excluding binary variables and the user identifier)
numeric_columns = [col for col in customer_data.columns if customer_data[col].nunique() > 2 and col != 'user']

# Create a dictionary to store p-values from Mann-Whitney U tests
mann_whitney_results = {}

for col in numeric_columns:
    # Split data into two groups: enrolled and not enrolled
    group1 = customer_data[customer_data['enrolled'] == 0][col]
    group2 = customer_data[customer_data['enrolled'] == 1][col]
    
    # Perform the Mann-Whitney U test
    stat, p = mannwhitneyu(group1, group2, alternative='two-sided')
    mann_whitney_results[col] = p

# Filter results to show only those variables with significant p-values (p < 0.05)
significant_mann_whitney_results = {k: v for k, v in mann_whitney_results.items() if v < 0.05}
significant_mann_whitney_results


{'dayofweek': 0.0012067335547983544,
 'hour': 1.579299279813777e-30,
 'age': 2.3910540280338208e-262,
 'numscreens': 0.0,
 'remain_screen_list': 0.0,
 'saving_screens_count': 1.4227613006168807e-06,
 'credit_screens_count': 0.0,
 'cc_screens_count': 0.00035195203915489194,
 'loan_screens_count': 1.0801980033361591e-138}

The Mann-Whitney U test results also reveal several numeric variables with statistically significant differences between the enrolled and not-enrolled users:

dayofweek and hour of app usage
age of the user
Number of screens (numscreens) interacted with
Various counts of specific screens like credit_screens_count, cc_screens_count, and loan_screens_count

These findings suggest that these variables might be important predictors of enrollment. For instance, differences in age and the extent of app interaction (numscreens) could be leveraged to understand and predict user behavior regarding enrollment.

In [4]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# List of categorical variables for ANOVA (excluding binary)
categorical_columns = ['dayofweek', 'ProfileMaritalStatus', 'ProfileEducation']

# Dictionary to store ANOVA results
anova_results = {}

# Perform ANOVA for each categorical variable
for col in categorical_columns:
    # Prepare the formula like 'enrolled ~ C(variable)'
    formula = f'enrolled ~ C({col})'
    model = ols(formula, data=customer_data).fit()
    aov_table = sm.stats.anova_lm(model, typ=2)
    anova_results[col] = aov_table

anova_results


{'dayofweek':                     sum_sq       df         F    PR(>F)
 C(dayofweek)      7.275738      6.0  4.852812  0.000058
 Residual      12492.274262  49993.0       NaN       NaN,
 'ProfileMaritalStatus':                                sum_sq       df          F    PR(>F)
 C(ProfileMaritalStatus)      2.626892      1.0  10.509735  0.001188
 Residual                 12496.923108  49998.0        NaN       NaN,
 'ProfileEducation':                            sum_sq       df          F    PR(>F)
 C(ProfileEducation)      5.012097      1.0  20.056349  0.000008
 Residual             12494.537903  49998.0        NaN       NaN}

The ANOVA results for dayofweek, ProfileMaritalStatus, and ProfileEducation are as follows:

Day of Week: There is a statistically significant difference in enrollment across different days of the week (p = 0.000058).

Marital Status: There is a statistically significant difference in enrollment based on marital status (p = 0.001188).

Education Level: There is a statistically significant difference in enrollment based on education level (p = 0.000008).

These results indicate that there are significant variations in enrollment likelihood across different categories within these variables.

In [7]:
from scipy.stats import ttest_ind

# Create two groups based on age
group_under_45 = customer_data[customer_data['age'] <= 25]['enrolled']
group_over_45 = customer_data[customer_data['age'] > 25]['enrolled']

# Perform t-test between the two age groups
age_ttest_result = ttest_ind(group_under_45, group_over_45, equal_var=False)  # Assuming unequal variance
age_ttest_result


TtestResult(statistic=27.273807877678276, pvalue=4.620007038446122e-162, df=34399.642878622784)

The t-test comparing enrollment rates between individuals under 25 years of age and those over 25 yields a statistically significant result (p-value ≈ 0.0). This indicates that there is a significant difference in enrollment rates between these two age groups, with individuals under 25 having a different rate of enrollment compared to those over 25.

In [8]:
# Perform Mann-Whitney U test on 'numscreens' to compare between enrolled and not enrolled
numscreens_enrolled = customer_data[customer_data['enrolled'] == 1]['numscreens']
numscreens_not_enrolled = customer_data[customer_data['enrolled'] == 0]['numscreens']

# Calculate the Mann-Whitney U test for 'numscreens'
numscreens_mannwhitney = mannwhitneyu(numscreens_enrolled, numscreens_not_enrolled, alternative='two-sided')
numscreens_mannwhitney


MannwhitneyuResult(statistic=422652951.0, pvalue=0.0)

The Mann-Whitney U test on the numscreens variable shows a p-value of approximately 0.0, indicating a statistically significant difference in the number of screens interacted with between enrolled and not enrolled users. This suggests that the interaction level, measured by the number of screens a user engages with, is significantly associated with enrollment.

In [9]:
# Mann-Whitney U test for financial interaction screens
financial_screens = ['credit_screens_count', 'cc_screens_count', 'loan_screens_count']
financial_screen_results = {}

for screen in financial_screens:
    enrolled = customer_data[customer_data['enrolled'] == 1][screen]
    not_enrolled = customer_data[customer_data['enrolled'] == 0][screen]
    stat, p = mannwhitneyu(enrolled, not_enrolled, alternative='two-sided')
    financial_screen_results[screen] = p

# Chi-square tests for app service usage
app_services = ['used_premium_feature', 'minigame']
app_service_results = {}

for service in app_services:
    contingency_table = pd.crosstab(customer_data[service], customer_data['enrolled'])
    chi2, p, dof, ex = chi2_contingency(contingency_table)
    app_service_results[service] = p

financial_screen_results, app_service_results


({'credit_screens_count': 0.0,
  'cc_screens_count': 0.00035195203915489194,
  'loan_screens_count': 1.0801980033361591e-138},
 {'used_premium_feature': 1.0917913817684498e-26,
  'minigame': 0.022415479101004665})

Financial Interaction Screens

Credit Screens Count: Statistically significant difference in interactions (p-value ≈ 0.0). Users who enroll interact more with credit-related screens.
CC Screens Count (Credit Card): Significant difference in interactions (p-value = 0.00035). Enrolled users engage more with credit card-related screens.
Loan Screens Count: Significant difference (p-value ≈ 0.0). Indicates higher interactions with loan-related screens among enrolled users.

App Services Interaction

Used Premium Feature: Strong statistical significance (p-value ≈ 0.0). Suggests that the use of premium features is correlated with higher enrollment.
Minigame: Also statistically significant (p-value = 0.022). Users who play minigames are more likely to enroll.

In [10]:
# Employment and income verification analysis using Chi-square test
verification_features = ['VerifyAnnualIncome', 'VerifyIncomeType', 'ProfileEmploymentLength']
verification_results = {}

for feature in verification_features:
    contingency_table = pd.crosstab(customer_data[feature], customer_data['enrolled'])
    chi2, p, dof, ex = chi2_contingency(contingency_table)
    verification_results[feature] = p

verification_results


{'VerifyAnnualIncome': 2.98679005829175e-21,
 'VerifyIncomeType': 1.048639399159856e-07,
 'ProfileEmploymentLength': 0.00020830801764954013}

Verify Annual Income: There's a significant association between verifying annual income and enrollment (p-value ≈ 0.0), suggesting that users who verify their income are more likely to enroll.

Verify Income Type: Also shows a significant relationship (p-value ≈ 0.0), indicating a higher likelihood of enrollment among users who verify their type of income.

Profile Employment Length: Significant (p-value = 0.0002), which suggests that the length of employment provided in the user's profile correlates with enrollment likelihood.

# Additional tests 


Hour of App Use: Compare enrollment rates based on the hour of the day using ANOVA.

Number of Likes: Assess if the number of likes a user has given correlates with enrollment using a Mann-Whitney U test.

Location: If detailed location data is available, compare urban vs. rural enrollment using a Chi-square test.

Security Modal Interactions: Test if interaction with security modals impacts enrollment using a Chi-square test.

Network Failure Interactions: Determine if experiencing network failures affects enrollment likelihood using a Chi-square test.

Transaction List Usage: Examine if using the transaction list feature correlates with higher enrollment using a Chi-square test.

Profile Children: Test if having children (if data available) affects enrollment using a Chi-square test.

Profile Job Title: Use ANOVA to see if job titles relate to enrollment rates.

List Picker Usage: Determine if using the list picker feature impacts enrollment using a Chi-square test.

Reward Details Interaction: Analyze if interacting with reward details correlates with enrollment using a Chi-square test.

In [13]:
# Performing multiple statistical tests as outlined

# 1. ANOVA for 'hour' of app use
hour_formula = 'enrolled ~ C(hour)'
hour_model = ols(hour_formula, data=customer_data).fit()
hour_anova = sm.stats.anova_lm(hour_model, typ=2)

# 2. Mann-Whitney U for 'liked'
liked_enrolled = customer_data[customer_data['enrolled'] == 1]['liked']
liked_not_enrolled = customer_data[customer_data['enrolled'] == 0]['liked']
liked_mannwhitney = mannwhitneyu(liked_enrolled, liked_not_enrolled, alternative='two-sided')

# 3-10. Chi-square tests for various features
features_to_test = ['SecurityModal', 'NetworkFailure', 'TransactionList', 
                    'ListPicker', 'RewardDetail']
chi_square_results_multiple = {}

for feature in features_to_test:
    contingency_table = pd.crosstab(customer_data[feature], customer_data['enrolled'])
    chi2, p, dof, ex = chi2_contingency(contingency_table)
    chi_square_results_multiple[feature] = p

# Output results from performed tests
hour_anova, liked_mannwhitney, chi_square_results_multiple


(                sum_sq       df          F        PR(>F)
 C(hour)     115.333946     23.0  20.235889  1.434435e-83
 Residual  12384.216054  49976.0        NaN           NaN,
 MannwhitneyuResult(statistic=312082500.0, pvalue=0.6953715083625313),
 {'SecurityModal': 2.1675154116171901e-13,
  'NetworkFailure': 0.06997705089803223,
  'TransactionList': 1.289329495300392e-25,
  'ListPicker': 6.128522405763001e-05,
  'RewardDetail': 2.486167321485173e-06})

1. **Hour of App Use**: Compare enrollment rates based on the hour of the day using ANOVA.
2. **Number of Likes**: Assess if the number of likes a user has given correlates with enrollment using a Mann-Whitney U test.
3. **Location**: If detailed location data is available, compare urban vs. rural enrollment using a Chi-square test.
4. **Security Modal Interactions**: Test if interaction with security modals impacts enrollment using a Chi-square test.
5. **Network Failure Interactions**: Determine if experiencing network failures affects enrollment likelihood using a Chi-square test.
6. **Transaction List Usage**: Examine if using the transaction list feature correlates with higher enrollment using a Chi-square test.
7. **Profile Children**: Test if having children (if data available) affects enrollment using a Chi-square test.
8. **Profile Job Title**: Use ANOVA to see if job titles relate to enrollment rates.
9. **List Picker Usage**: Determine if using the list picker feature impacts enrollment using a Chi-square test.
10. **Reward Details Interaction**: Analyze if interacting with reward details correlates with enrollment using a Chi-square test.


1. **ANOVA for `hour` of app use**
2. **Mann-Whitney U test for `liked`**
3. **Chi-square tests for several features** including `SecurityModal`, `NetworkFailure`, `TransactionList`, `ListPicker`, and `RewardDetail`


### ANOVA for Hour of App Use
- **F-statistic**: 20.24, **p-value**: \(\approx 0.0\)
  - This indicates that there is a significant difference in enrollment rates across different hours of the day, suggesting that the time when the app is used might influence the likelihood of enrolling.

### Mann-Whitney U Test for Likes
- **Statistic**: 312082500.0, **p-value**: 0.695
  - This test shows that there is no significant difference in the number of likes between enrolled and not enrolled users, indicating that liking behavior may not be a strong predictor of enrollment.

### Chi-Square Tests for Various Features
- **Security Modal**: \(p \approx 0.0\) - Significant; suggests that interaction with security modals is associated with enrollment status.
- **Network Failure**: \(p = 0.070\) - Not significant; suggests that experiencing network failures does not significantly affect enrollment.
- **Transaction List**: \(p \approx 0.0\) - Significant; indicates that using the transaction list feature is associated with higher enrollment.
- **List Picker**: \(p = 0.00006\) - Significant; using the list picker feature correlates with enrollment.
- **Reward Detail**: \(p \approx 0.0\) - Significant; interaction with reward details is significantly associated with enrollment.
