Functions
----------

We expose 6 functions, each of which either take a pandas DataFrame or a CSV. If the CSV doesn't have a header,
we make some assumptions about where the data is

-  **census\_ln**

   -  Input: pandas DataFrame or CSV and a string or list of the name or
      location of the column containing the last name.

   -  What it does:

      -  Removes extra space.
      -  For names in the `census file <ethnicolr/data/census>`__, it appends relevant data.

   -  Options:

      -  year: 2000 or 2010
      -  if no year is given, data from the 2000 census is appended

   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      pctwhite, pctblack, pctapi, pctaian, pct2prace, pcthispanic. See 
      `here <https://github.com/appeler/ethnicolr/blob/master/ethnicolr/data/census/census_2000.pdf>`__ for what the column names mean.

-  **pred\_census\_ln**

   -  Input: pandas DataFrame or CSV and string or list of the name or
      location of the column containing the last name.

   -  What it does:

      -  Removes extra space.
      -  Uses the `last name census 2000
         model <ethnicolr/models/ethnicolr_keras_lstm_census2000_ln.ipynb>`__
         or `last name census 2010
         model <ethnicolr/models/ethnicolr_keras_lstm_census2010_ln.ipynb>`__
         to predict the race and ethnicity.

   -  Options:

      -  year: 2000 or 2010

   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (white, black, asian, or hispanic), api (percentage chance asian),
      black, hispanic, white. 

-  **pred\_wiki\_ln**

   -  Input: pandas DataFrame or CSV and string or list of the name or
      location of the column containing the last name.

   -  What it does:

      -  Removes extra space.
      -  Uses the `last name wiki model <ethnicolr/models/ethnicolr_keras_lstm_wiki_ln.ipynb>`__
         to predict the race and ethnicity.

   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (categorical variable --- category with the highest probability), 
      "Asian,GreaterEastAsian,EastAsian", "Asian,GreaterEastAsian,Japanese", 
      "Asian,IndianSubContinent", "GreaterAfrican,Africans", "GreaterAfrican,Muslim",
      "GreaterEuropean,British","GreaterEuropean,EastEuropean", 
      "GreaterEuropean,Jewish","GreaterEuropean,WestEuropean,French",
      "GreaterEuropean,WestEuropean,Germanic","GreaterEuropean,WestEuropean,Hispanic",
      "GreaterEuropean,WestEuropean,Italian","GreaterEuropean,WestEuropean,Nordic"

-  **pred\_wiki\_name**

   -  Input: pandas DataFrame or CSV and string or list containing the name or
      location of the column containing the first name, last name, middle
      name, and suffix, if there. The first name and last name columns are
      required. If no middle name of suffix columns are there, it is
      assumed that there are no middle names or suffixes.

   -  What it does:

      -  Removes extra space.
      -  Uses the `full name wiki
         model <ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb>`__ to predict the
         race and ethnicity.

   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (categorical variable---category with the highest probability), 
      "Asian,GreaterEastAsian,EastAsian", "Asian,GreaterEastAsian,Japanese", 
      "Asian,IndianSubContinent", "GreaterAfrican,Africans", "GreaterAfrican,Muslim",
      "GreaterEuropean,British","GreaterEuropean,EastEuropean", 
      "GreaterEuropean,Jewish","GreaterEuropean,WestEuropean,French",
      "GreaterEuropean,WestEuropean,Germanic","GreaterEuropean,WestEuropean,Hispanic",
      "GreaterEuropean,WestEuropean,Italian","GreaterEuropean,WestEuropean,Nordic"

-  **pred\_fl\_reg\_ln**

   -  Input: pandas DataFrame or CSV and string or list of the name or location
      of the column containing the last name.

   -  What it does?:

      -  Removes extra space, if there.
      -  Uses the `last name FL registration
         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln.ipynb>`__ to predict the race
         and ethnicity.

   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (white, black, asian, or hispanic), asian (percentage chance Asian),
      hispanic, nh_black, nh_white

-  **pred\_fl\_reg\_name**

   -  Input: pandas DataFrame or CSV and string or list containing the name or
      location of the column containing the first name, last name, middle
      name, and suffix, if there. The first name and last name columns are
      required. If no middle name of suffix columns are there, it is
      assumed that there are no middle names or suffixes.

   -  What it does:

      -  Removes extra space.
      -  Uses the `full name FL
         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_name.ipynb>`__ to predict the
         race and ethnicity.

   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (white, black, asian, or hispanic), asian (percentage chance Asian),
      hispanic, nh_black, nh_white

-  **pred\_nc\_reg\_name**

   -  Input: pandas DataFrame or CSV and string or list containing the name or
      location of the column containing the first name, last name, middle
      name, and suffix, if there. The first name and last name columns are
      required. If no middle name of suffix columns are there, it is
      assumed that there are no middle names or suffixes.

   -  What it does:

      -  Removes extra space.
      -  Uses the `full name NC
         model <ethnicolr/models/ethnicolr_keras_lstm_nc_12_cat_model.ipynb>`__ to predict the
         race and ethnicity.

   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race + ethnicity. The codebook is `here <https://github.com/appeler/nc_race_ethnicity>`__ 

In [1]:
import pandas as pd
from ethnicolr import *

In [2]:
names = [
         {'name': 'smith'},
         {'name': 'cai'},
         {'name': 'jackson'}
        ]
df = pd.DataFrame(names)

In [3]:
census_ln(df, 'name', 2000)

Unnamed: 0,name,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
0,smith,73.35,22.22,0.4,0.85,1.63,1.56
1,cai,1.56,0.75,95.96,(S),1.26,(S)
2,jackson,41.93,53.02,0.31,1.04,2.18,1.53


In [4]:
census_ln(df, 'name', 2010)

Unnamed: 0,name,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
0,smith,70.9,23.11,0.5,0.89,2.19,2.4
1,cai,1.54,(S),95.83,(S),1.08,0.91
2,jackson,39.89,53.04,0.39,1.06,3.12,2.5


In [5]:
pred_census_ln(df, 'name', 2000)



1/1 - 0s
1/1 - 0s




Unnamed: 0,name,race,api,black,hispanic,white
0,smith,white,0.002019,0.247235,0.014485,0.73626
1,cai,api,0.911435,0.004869,0.025236,0.05846
2,jackson,black,0.002797,0.528193,0.014605,0.454405


In [6]:
pred_census_ln(df, 'name', 2010)

1/1 - 0s
1/1 - 0s




Unnamed: 0,name,race,api,black,hispanic,white
0,smith,white,0.002019,0.247235,0.014485,0.73626
1,cai,api,0.911435,0.004869,0.025236,0.05846
2,jackson,black,0.002797,0.528193,0.014605,0.454405


In [7]:
pred_wiki_name(df, 'name', 'name')



1/1 - 0s
1/1 - 0s




Unnamed: 0,name,race,"Asian,GreaterEastAsian,EastAsian","Asian,GreaterEastAsian,Japanese","Asian,IndianSubContinent","GreaterAfrican,Africans","GreaterAfrican,Muslim","GreaterEuropean,British","GreaterEuropean,EastEuropean","GreaterEuropean,Jewish","GreaterEuropean,WestEuropean,French","GreaterEuropean,WestEuropean,Germanic","GreaterEuropean,WestEuropean,Hispanic","GreaterEuropean,WestEuropean,Italian","GreaterEuropean,WestEuropean,Nordic"
0,smith,"GreaterEuropean,British",0.003173,0.001601,0.002712,0.001695,0.000385,0.93842,0.00041,0.017481,0.00948,0.002295,0.008037,0.006935,0.007374
1,cai,"Asian,GreaterEastAsian,EastAsian",0.60247,0.002524,0.010841,0.022514,0.034715,0.034711,0.011194,0.024797,0.187769,0.007142,0.008134,0.046725,0.006463
2,jackson,"GreaterEuropean,British",0.015043,0.017981,0.003017,0.005963,0.000315,0.840078,0.001387,0.042143,0.004479,0.003223,0.035877,0.004686,0.025809


In [8]:
pred_wiki_ln(df, 'name')



1/1 - 0s
1/1 - 0s




Unnamed: 0,name,race,"Asian,GreaterEastAsian,EastAsian","Asian,GreaterEastAsian,Japanese","Asian,IndianSubContinent","GreaterAfrican,Africans","GreaterAfrican,Muslim","GreaterEuropean,British","GreaterEuropean,EastEuropean","GreaterEuropean,Jewish","GreaterEuropean,WestEuropean,French","GreaterEuropean,WestEuropean,Germanic","GreaterEuropean,WestEuropean,Hispanic","GreaterEuropean,WestEuropean,Italian","GreaterEuropean,WestEuropean,Nordic"
0,smith,"GreaterEuropean,British",0.002573,0.004109,0.002233,0.005376,0.002803,0.930602,0.000945,0.015395,0.015284,0.001949,0.012589,0.003486,0.002656
1,cai,"Asian,GreaterEastAsian,EastAsian",0.631215,0.019463,0.058025,0.0136,0.048377,0.033724,0.011785,0.024261,0.047024,0.003701,0.016103,0.08952,0.003201
2,jackson,"GreaterEuropean,British",0.002014,0.001393,0.001211,0.001629,0.000349,0.946337,0.000304,0.006481,0.016904,0.000856,0.01854,0.001492,0.00249


In [9]:
pred_nc_reg_name(df, 'name', 'name')



1/1 - 0s
1/1 - 0s




Unnamed: 0,name,race,HL+A,HL+B,HL+I,HL+M,HL+O,HL+W,NL+A,NL+B,NL+I,NL+M,NL+O,NL+W
0,smith,NL+O,0.0001299761,6.06964e-06,1.718623e-10,0.00106302,0.002065,0.02538,0.016214,0.265759,0.21488,0.002355,0.31552,0.156628
1,cai,NL+A,1.994796e-10,7.564134e-14,1.354651e-11,1.474962e-08,0.000149,4e-06,0.987327,0.000164,1.1e-05,1.2e-05,0.011661,0.000673
2,jackson,HL+W,8.784482e-05,0.002971419,2.314539e-08,0.002956621,0.049923,0.4453,0.075231,0.134984,0.005478,0.031609,0.038749,0.212711


In [10]:
pred_fl_reg_ln(df, 'name')



1/1 - 0s
1/1 - 0s




Unnamed: 0,name,race,asian,hispanic,nh_black,nh_white
0,smith,nh_white,0.006023,0.01148,0.22286,0.759638
1,cai,asian,0.801705,0.0127,0.059532,0.126064
2,jackson,nh_black,0.004752,0.006477,0.548833,0.439938
