<a href="https://colab.research.google.com/github/cindyfu/MLinPractice/blob/main/10708_SplitDataset_baselines_Xuecong.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
#!wget https://dsapp-public-data-migrated.s3.us-west-2.amazonaws.com/donors_choose_2014_raw_CSVs.zip
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!unzip /content/drive/MyDrive/donors_choose_2014_raw_CSVs.zip

Archive:  /content/drive/MyDrive/donors_choose_2014_raw_CSVs.zip
  inflating: donations.csv.zip       
  inflating: essays.csv.zip          
  inflating: outcomes.csv.zip        
  inflating: projects.csv.zip        
  inflating: resources.csv.zip       
  inflating: sampleSubmission.csv.zip  


In [5]:
!ls -lh

total 926M
-rw-r--r-- 1 root root 253M Dec 11  2019 donations.csv.zip
drwx------ 6 root root 4.0K Sep 19 17:52 drive
-rw-r--r-- 1 root root 403M Dec 11  2019 essays.csv.zip
-rw-r--r-- 1 root root  12M Dec 11  2019 outcomes.csv.zip
-rw-r--r-- 1 root root  65M Dec 11  2019 projects.csv.zip
-rw-r--r-- 1 root root 194M Dec 11  2019 resources.csv.zip
drwxr-xr-x 1 root root 4.0K Sep 14 13:44 sample_data
-rw-r--r-- 1 root root 794K Dec 11  2019 sampleSubmission.csv.zip


In [6]:
from collections import defaultdict, Counter, namedtuple
import gzip, os, sys, time, pickle, itertools, copy, datetime, gc, importlib, resource, re, logging
from datetime import datetime, timedelta
from multiprocessing import Pool
from tqdm.auto import tqdm, trange
from pathlib import Path
from typing import List, Dict, Union

import numpy as np
from numpy.lib import recfunctions as rfn
import pandas as pd
from scipy.sparse import csr_matrix, csc_matrix, coo_matrix
from scipy.stats import ttest_ind, zscore
from scipy.cluster import hierarchy
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, pairwise_distances, accuracy_score, f1_score, precision_score, recall_score
from sklearn.manifold import TSNE
from sklearn.neighbors import NearestNeighbors
from statsmodels.stats.multitest import multipletests

import torch
from torch import nn
from torch.optim import Adam
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset, TensorDataset

import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
%config InlineBackend.figure_format = 'retina'
plt.rcParams['pdf.fonttype'] = 42
plt.rcParams['svg.fonttype'] = 'none'

In [78]:
projects_df = pd.read_csv('projects.csv.zip')
donations_df = pd.read_csv('donations.csv.zip')
# resources_df = pd.read_csv('resources.csv.zip')

In [79]:
projects_df = projects_df.set_index('projectid')
projects_df['date_posted'] = projects_df['date_posted'].apply(lambda _: datetime.strptime(_, '%Y-%m-%d'))
donations_df['donation_timestamp'] = donations_df['donation_timestamp'].apply(lambda _: datetime.strptime(_.split('.')[0], '%Y-%m-%d %H:%M:%S'))
# -- data cleaning: remove projects that do not need funding
projects_df = projects_df[projects_df['total_price_excluding_optional_support'] > 0]
donations_df = donations_df[donations_df['projectid'].isin(projects_df.index)]
# -- data cleaning: remove donations that happen 4 months (120 days technically) after the project is posted
donations_df = donations_df[donations_df['donation_timestamp'].values < (projects_df.loc[donations_df['projectid'], 'date_posted'] + timedelta(120)).values]
# -- add features: calculate funded price and funded fraction for each project
projects_df['funded_price'] = donations_df.groupby('projectid')['donation_to_project'].sum()
projects_df['funded_price'].fillna(0., inplace=True)
projects_df['funded_fraction'] = projects_df['funded_price'] / projects_df['total_price_excluding_optional_support']

In [80]:
essays_df = pd.read_csv('essays.csv.zip')
essays_df = essays_df.set_index('projectid')

In [38]:
essays_df

Unnamed: 0_level_0,teacher_acctid,title,short_description,need_statement,essay
projectid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ffffc4f85b60efc5b52347df489d0238,c24011b20fc161ed02248e85beb59a90,iMath,It is imperative that teachers bring technolog...,My students need four iPods.,I am a fourth year fifth grade math teacher. T...
ffffac55ee02a49d1abc87ba6fc61135,947066d0af47e0566f334566553dd6a6,Recording Rockin' Readers,Can you imagine having to translate everything...,My students need a camcorder.,Can you imagine having to translate everything...
ffff97ed93720407d70a2787475932b0,462270f5d5c212162fcab11afa2623cb,Kindergarten In Need of Important Materials!,It takes a special person to donate to a group...,My students need 17 assorted classroom materia...,Hi. I teach a wonderful group of 4-5 year old ...
ffff7266778f71242675416e600b94e1,b9a8f14199e0d8109200ece179281f4f,Let's Find Out!,My Kindergarten students come from a variety o...,"My students need 25 copies of Scholastic's ""Le...",My Kindergarten students come from a variety o...
ffff418bb42fad24347527ad96100f81,e885fb002a1d0d39aaed9d21a7683549,Whistle While We Work!,"By using the cross curricular games requested,...",My students need grade level appropriate games...,All work and no play makes school a dull place...
...,...,...,...,...,...
0000ee613c92ddc5298bf63142996a5c,e0c0a0214d3c2cfdc0ab6639bc3c5342,Technology Upgrade A Must-Kindergartners Ready!,Kindergarten is an exciting time for learning ...,My students need an iPad mini to support instr...,Kindergarten is an exciting time for learning ...
0000b38bbc7252972f7984848cf58098,e1aa1ae5301d0cda860c4d9c89c24919,Visual Display Technology in the Classroom,My students have very limited exposure to tech...,My students need access to a projector in the ...,My students have very limited exposure to tech...
00002d691c05c51a5fdfbb2baef0ba25,7ad6abc974dd8b62773f79f6cbed48d5,You Go Read at HRS,"My students need high quality books, such as W...","My students need high quality books, such as W...",Our students need the challenge to read high q...
00002bff514104264a6b798356fdd893,3414541eb63108700b188648f866f483,Speedy Shark Reading Club,My students need more incentives to make them ...,My students need word building centers 20 At-Y...,My students need more incentives to make them ...


In [37]:
len(set(essays_df.index).intersection(projects_df.index))

663788

In [51]:
DatasetTuple = namedtuple('DatasetTuple', 'projects_df donations_df essays_df')

def subset_dataset(projects_df: pd.DataFrame, donations_df: pd.DataFrame, essays_df: pd.DataFrame, time_start: datetime, time_end: datetime) -> DatasetTuple:
    projects_df = projects_df[projects_df['date_posted'].between(time_start, time_end, inclusive='left')].copy()
    donations_df = donations_df[donations_df['projectid'].isin(projects_df.index)].copy()
    essays_df = essays_df[essays_df.index.isin(projects_df.index)].copy()
    return DatasetTuple(projects_df, donations_df, essays_df)

train_valid_pair1 = (
    subset_dataset(projects_df, donations_df, essays_df, datetime(2009, 1, 1), datetime(2010, 1, 1)),
    subset_dataset(projects_df, donations_df, essays_df, datetime(2011, 1, 1), datetime(2012, 1, 1)),
)
train_valid_pair2 = (
    subset_dataset(projects_df, donations_df, essays_df, datetime(2010, 1, 1), datetime(2011, 1, 1)),
    subset_dataset(projects_df, donations_df, essays_df, datetime(2012, 1, 1), datetime(2013, 1, 1)),
)
train_valid_pair3 = (
    subset_dataset(projects_df, donations_df, essays_df, datetime(2011, 1, 1), datetime(2012, 1, 1)),
    subset_dataset(projects_df, donations_df, essays_df, datetime(2013, 1, 1), datetime(2014, 1, 1)),
)

In [52]:
for train_tuple, valid_tuple in [train_valid_pair1, train_valid_pair2, train_valid_pair3]:
    print(train_tuple.projects_df.shape, train_tuple.donations_df.shape, valid_tuple.projects_df.shape, valid_tuple.donations_df.shape)

(63547, 36) (188496, 21) (104196, 36) (623037, 21)
(86448, 36) (452323, 21) (117626, 36) (619055, 21)
(104196, 36) (623037, 21) (131329, 36) (799986, 21)


In [76]:
# three baselines:
from sklearn.metrics import precision_score, classification_report

def get_baselines(dataset_tuple, a=40):
    projects_df, donations_df, essays_df = dataset_tuple
    essays_df['essay_length'] = essays_df.essay.str.len()
    projects_df = projects_df.join(essays_df['essay_length'])
    ground_truth = (projects_df.funded_fraction < 1).astype(int)
    N = projects_df.shape[0]

    print("base rate: ", np.mean(ground_truth))

    # randomly choose a% low poverty_level
    baseline_num = int(a*N/100)
    projects_df['tp_bl1'] = np.zeros(N, dtype="int")
    if sum(projects_df.poverty_level.isin(["low poverty"])) >= baseline_num:
        tp_pov = projects_df[projects_df.poverty_level == "low poverty"]
        projects_df['tp_bl1'][np.random.choice(tp_pov.index, size=baseline_num, replace=False)] =  1
    elif sum(projects_df.poverty_level.isin(["low poverty", "moderate poverty"])) >= baseline_num:
        tp_pov1 = projects_df[projects_df.poverty_level == "low poverty"]
        projects_df['tp_bl1'][tp_pov1.index] =  1
        tp_pov = projects_df[projects_df.poverty_level == "moderate poverty"]
        projects_df['tp_bl1'][np.random.choice(tp_pov.index, size=baseline_num - len(tp_pov1), replace=False)] =  1 
    elif sum(projects_df.poverty_level.isin(["low poverty", "moderate poverty", "high poverty"])) >= baseline_num:
        tp_pov1 = projects_df[projects_df.poverty_level.isin(["low poverty", "moderate poverty"])]
        projects_df['tp_bl1'][tp_pov1.index] =  1
        tp_pov = projects_df[projects_df.poverty_level == "high poverty"]
        projects_df['tp_bl1'][np.random.choice(tp_pov.index, size=baseline_num - len(tp_pov1), replace=False)] =  1 
    elif sum(projects_df.poverty_level.isin(["low poverty", "moderate poverty", "high poverty", "highest poverty"])) >= baseline_num:
        tp_pov1 = projects_df[projects_df.poverty_level.isin(["low poverty", "moderate poverty", "high poverty"])]
        projects_df['tp_bl1'][tp_pov1.index] =  1
        tp_pov = projects_df[projects_df.poverty_level == "highest poverty"]
        projects_df['tp_bl1'][np.random.choice(tp_pov.index, size=baseline_num - len(tp_pov1), replace=False)] =  1 

    # top a% based on total asking price
    projects_df['tp_bl2'] = np.zeros(N, dtype="int")
    projects_df['tp_bl2'][projects_df.nlargest(baseline_num, "total_price_excluding_optional_support").index] = 1

    # shortest a% essay
    projects_df['tp_bl3'] = np.zeros(projects_df.shape[0], dtype="int")
    projects_df['tp_bl3'][projects_df.nsmallest(baseline_num, "essay_length").index] = 1

    print("randomly choose {0}% of low/moderate poverty".format(a))
    print(classification_report(y_true=ground_truth,
                        y_pred=np.array(projects_df['tp_bl1']), labels=[1],digits=4))

    print("highest {0}% total_price_excluding_optional_support benchmark:".format(a))
    print(classification_report(y_true=ground_truth,
                        y_pred=np.array(projects_df['tp_bl2']), labels=[1],digits=4))

    print("shorest {0}% essay benchmark:".format(a))
    print(classification_report(y_true=ground_truth,
                        y_pred=np.array(projects_df['tp_bl3']), labels=[1],digits=4))

In [77]:
for i, (train_tuple, valid_tuple) in enumerate([train_valid_pair1, train_valid_pair2, train_valid_pair3]):
  print("train validation pair " + str(i))
  print("training set")
  get_baselines(train_tuple, a=10)
  print("validation set")
  get_baselines(valid_tuple, a=10)

train validation pair 0
training set
base rate:  0.4418619289659622
randomly choose 10% of low/moderate poverty


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydat

              precision    recall  f1-score   support

           1     0.5574    0.1261    0.2057     28079

   micro avg     0.5574    0.1261    0.2057     28079
   macro avg     0.5574    0.1261    0.2057     28079
weighted avg     0.5574    0.1261    0.2057     28079

highest 10% total_price_excluding_optional_support benchmark:
              precision    recall  f1-score   support

           1     0.5519    0.1249    0.2037     28079

   micro avg     0.5519    0.1249    0.2037     28079
   macro avg     0.5519    0.1249    0.2037     28079
weighted avg     0.5519    0.1249    0.2037     28079

shorest 10% essay benchmark:
              precision    recall  f1-score   support

           1     0.4742    0.1073    0.1750     28079

   micro avg     0.4742    0.1073    0.1750     28079
   macro avg     0.4742    0.1073    0.1750     28079
weighted avg     0.4742    0.1073    0.1750     28079

validation set
base rate:  0.4367346155322661


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydat

randomly choose 10% of low/moderate poverty
              precision    recall  f1-score   support

           1     0.4547    0.1041    0.1694     45506

   micro avg     0.4547    0.1041    0.1694     45506
   macro avg     0.4547    0.1041    0.1694     45506
weighted avg     0.4547    0.1041    0.1694     45506

highest 10% total_price_excluding_optional_support benchmark:
              precision    recall  f1-score   support

           1     0.5947    0.1362    0.2216     45506

   micro avg     0.5947    0.1362    0.2216     45506
   macro avg     0.5947    0.1362    0.2216     45506
weighted avg     0.5947    0.1362    0.2216     45506

shorest 10% essay benchmark:
              precision    recall  f1-score   support

           1     0.4378    0.1002    0.1631     45506

   micro avg     0.4378    0.1002    0.1631     45506
   macro avg     0.4378    0.1002    0.1631     45506
weighted avg     0.4378    0.1002    0.1631     45506

train validation pair 1
training set
base rate

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydat

              precision    recall  f1-score   support

           1     0.4727    0.1209    0.1926     33784

   micro avg     0.4727    0.1209    0.1926     33784
   macro avg     0.4727    0.1209    0.1926     33784
weighted avg     0.4727    0.1209    0.1926     33784

highest 10% total_price_excluding_optional_support benchmark:
              precision    recall  f1-score   support

           1     0.5464    0.1398    0.2226     33784

   micro avg     0.5464    0.1398    0.2226     33784
   macro avg     0.5464    0.1398    0.2226     33784
weighted avg     0.5464    0.1398    0.2226     33784

shorest 10% essay benchmark:
              precision    recall  f1-score   support

           1     0.3965    0.1014    0.1615     33784

   micro avg     0.3965    0.1014    0.1615     33784
   macro avg     0.3965    0.1014    0.1615     33784
weighted avg     0.3965    0.1014    0.1615     33784

validation set
base rate:  0.3754952136432421


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydat

randomly choose 10% of low/moderate poverty
              precision    recall  f1-score   support

           1     0.3925    0.1045    0.1651     44168

   micro avg     0.3925    0.1045    0.1651     44168
   macro avg     0.3925    0.1045    0.1651     44168
weighted avg     0.3925    0.1045    0.1651     44168

highest 10% total_price_excluding_optional_support benchmark:
              precision    recall  f1-score   support

           1     0.4901    0.1305    0.2061     44168

   micro avg     0.4901    0.1305    0.2061     44168
   macro avg     0.4901    0.1305    0.2061     44168
weighted avg     0.4901    0.1305    0.2061     44168

shorest 10% essay benchmark:
              precision    recall  f1-score   support

           1     0.3879    0.1033    0.1631     44168

   micro avg     0.3879    0.1033    0.1631     44168
   macro avg     0.3879    0.1033    0.1631     44168
weighted avg     0.3879    0.1033    0.1631     44168

train validation pair 2
training set
base rate

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydat

              precision    recall  f1-score   support

           1     0.4594    0.1052    0.1712     45506

   micro avg     0.4594    0.1052    0.1712     45506
   macro avg     0.4594    0.1052    0.1712     45506
weighted avg     0.4594    0.1052    0.1712     45506

highest 10% total_price_excluding_optional_support benchmark:
              precision    recall  f1-score   support

           1     0.5947    0.1362    0.2216     45506

   micro avg     0.5947    0.1362    0.2216     45506
   macro avg     0.5947    0.1362    0.2216     45506
weighted avg     0.5947    0.1362    0.2216     45506

shorest 10% essay benchmark:
              precision    recall  f1-score   support

           1     0.4378    0.1002    0.1631     45506

   micro avg     0.4378    0.1002    0.1631     45506
   macro avg     0.4378    0.1002    0.1631     45506
weighted avg     0.4378    0.1002    0.1631     45506

validation set
base rate:  0.3371304129324064


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydat

randomly choose 10% of low/moderate poverty
              precision    recall  f1-score   support

           1     0.3513    0.1042    0.1607     44275

   micro avg     0.3513    0.1042    0.1607     44275
   macro avg     0.3513    0.1042    0.1607     44275
weighted avg     0.3513    0.1042    0.1607     44275

highest 10% total_price_excluding_optional_support benchmark:
              precision    recall  f1-score   support

           1     0.4599    0.1364    0.2104     44275

   micro avg     0.4599    0.1364    0.2104     44275
   macro avg     0.4599    0.1364    0.2104     44275
weighted avg     0.4599    0.1364    0.2104     44275

shorest 10% essay benchmark:
              precision    recall  f1-score   support

           1     0.3313    0.0983    0.1516     44275

   micro avg     0.3313    0.0983    0.1516     44275
   macro avg     0.3313    0.0983    0.1516     44275
weighted avg     0.3313    0.0983    0.1516     44275



In [81]:
get_baselines((projects_df, donations_df, essays_df),10)

base rate:  0.4521895544963151


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydat

randomly choose 10% of low/moderate poverty
              precision    recall  f1-score   support

           1     0.4914    0.1087    0.1780    300158

   micro avg     0.4914    0.1087    0.1780    300158
   macro avg     0.4914    0.1087    0.1780    300158
weighted avg     0.4914    0.1087    0.1780    300158

highest 10% total_price_excluding_optional_support benchmark:
              precision    recall  f1-score   support

           1     0.6336    0.1401    0.2295    300158

   micro avg     0.6336    0.1401    0.2295    300158
   macro avg     0.6336    0.1401    0.2295    300158
weighted avg     0.6336    0.1401    0.2295    300158

shorest 10% essay benchmark:
              precision    recall  f1-score   support

           1     0.4656    0.1030    0.1686    300158

   micro avg     0.4656    0.1030    0.1686    300158
   macro avg     0.4656    0.1030    0.1686    300158
weighted avg     0.4656    0.1030    0.1686    300158

