# **Spoken Language Diversity Across Washington Counties**
#### *Arona Cho (aronacho@uw.edu), Arpan Kapoor (akap1204@uw.edu)*
###### *CSE 163 - Intermediate Data Programming | Data Science Fair Project Report & Code | University of Washington*

***
## **Summary** 
***

* ## Research Question 1: How does language diversity in each Washington county affect their educational attainment outcomes?
##### &emsp;Language diversity was not found to have as large of an effect on educational attainment across Washington counties as we thought it might have. In fact, for certain non-English language speaking groups, it can be argued that there is an inverse relationship, where counties with more language speakers see lower rates of educational attainment. It seems that the results for each language speaking group is either the previously mentioned case, or there is a very slight correlation between increasing language diversity and educational attainment.

* ## Research Question 2: How do school district budgets and spending affect spoken language diversity in WA counties?
##### &emsp;From our analysis, school district budgets and spending do not seem to have a strong effect on spoken language diversity in WA counties, nor the inverse.

* ## Research Question 3: To what extent do Washington residents continue to practice their ancestral culture through language?
##### &emsp;Generally, we have found that Washington residents with a diverse ancestral background do not have a constant relationship with the amount of ancestral language is spoken. Rather, in counties with more ancestrally diverse populations, the rate of residents that speak only English increases as well.



***
## **Motivation**
***
##### &emsp;Spoken language diversity in our communities directly affects the extent to which various cultures can communicate and interact with each other. Identifying the underlying reasons behind why certain Washington counties have larger ranges of spoken languages can allow for the expansion of language diversity in other counties lacking in this aspect to foster closer, more personal, communities. We look at two vital areas in Washington state's primary communities: The academia space and family households. Local governments can use these conclusions to increase efforts and redirect resources in order to increase spoken language diversity in their respective territories.

***
## **Data Setting**
***
##### &emsp; For this project, we retrieved the majority of our datasets from the United States Census, focusing on data from all 39 Washington counties of the most recent year that was record, with the majority of our data coming from 2022. We used a total of five datasets for this research project, four of which came from the US Census. The one we did not retrieve from the US Census was the dataset containing per pupil expenditure (PPE) values for all school districts in Washington, which was retrieved from the Official Washington State Open Data Portal. For this single dataset, we used the data.wa.gov website's built in aggregation tools to filter out unnecessary data points that were not relevant for our project. This was a solution to the initial issue of the dataset's file size being too large, and also made it easier for us to work with the data once it had been imported. After aggregation, we were left with data for school district names and their PPEs. We used the datasets from the US Census to merge on matching school districts in order to find which counties each district is in, allowing us to find the average PPE value for each Washington county. These datasets allowed us to examine collections of county data in a more granular manner as we were able to merge them with each other in order to create plots showing trends between the shared data values.
##### &emsp; As the majority of this data is taken from the US Census Bureau, the population in which is encapsulated is all United States residents, or all Washington State residents in our case, at the time of collection. Much of the data is respondent, meaning that the data is prone to response bias. Overall, the census data is vulnerable to response biases because of the participants succumbing to social desirability which may skew the data to support more socially acceptable answers. Despite this, we believe that some solid takeaways can be formed using the resulting graphs and data plots, and could be further supported or investigated by future research and analysis in this area.
##### The titles, links, and sources for each dataset used in this research project are listed below:
* ##### School Enrollment (S1401) - US Census: https://data.census.gov/table/ACSST5Y2022.S1401?q=s1401&g=050XX00US53001
* ##### Language Spoken at Home (S1601) - US Census: https://data.census.gov/table/ACSST5Y2022.S1601?q=s1601&g=050XX00US53001
* ##### Selected Social Characteristics in the United States (DP02) - US Census: https://data.census.gov/table/ACSDP5Y2022.DP02?q=Dp02&g=050XX00US53001
* ##### School Districts and Associated Counties - US Census: https://www.census.gov/programs-surveys/saipe/guidance-geographies/districts-counties.html
* ##### Per Pupil Expenditures All Years - Washington State Open Data Portal: https://data.wa.gov/education/Per-Pupil-Expenditure_AllYears/vnm3-j8pe/about_data

***
## **Method**
***
##### 1. Retrieve the four datasets listed in the Data Setting section containing values for all 39 Washington counties from the US Census website.
##### 2. Retrieve the dataset containing the per pupil expenditure information for all school districts in Washington from the data.wa.gov website.
##### 3. Use a github repository to upload all collected datasets in order to store them in an organized, shareable manner.
##### 4. Use VSCode, utilizing the Live Share, Github Codespaces, and Jupyter extensions in order to collaboratively work on the code and datasets for this project.
##### 5. Import the datasets from the repository using Github Codespaces and create dataframes for each of them.
##### 6. Clean each dataset, removing any unneeded data values and reformatting any irregular indexes, columns/column names, and data values.
##### 7. Merge datasets to have the necessary information within the same dataframe, ready to be plotted.
##### 8. Use Plotly to create visuals in order to identify trends between the accumulated dataframes.
##### 9. If a significant correlation is suspected from the initial visualizations, perform a regression analysis to verify or disprove the suspected hypothesis.
##### 10. Document trends/correlations found from Plotly visualizations and any other conclusions found.
##### 11. Compile findings in a presentable format using Jupyter coding and markdown cells.


##### &emsp; First, to gauge the diversity of languages at a smaller scale, we chose to group all non-English speaking households together against the exclusively English speaking households. We were intrigued by the variability of the households that spoke English only and thought that the English Only households provided a range that was wide enough that could potentially show us a relationship to the ancestrial background when utilized as a predicted variable. 
##### &emsp; The scatterplot shows of the rates of English Only speakng households, as determined above, by ancestrial diversity. We can clearly see a postiive and fairly strong linear relationship, indicating that there is some correlation between these two variables. Using the least-ordinary squares table, it shows that the R-squared value is 0.825, meaning that 82.5% of the variability in the English Only households can be attributed to the ancestrial diversity. Also, with the p-value being less than the 5% significance level of 0.000, we can reject the notion that English Only households do not have a correlation with ancestrial diversity. Before running these analyses, we expected the results to have an inverse relationship, as we assumed that more ancestral diversity was an indicator of more non-English languages being used. We were surprised to have found that this was the opposite and to conclude that it is to a low extent that Washington residents continue to practice their ancestral culture through language. When taking on a communicative lens, many times, immigrant families purposefully choose not to teach their children their native tongue in hopes of faster assimulation by picking up English and becoming Americanized. Another reason for this positive correlation could be that when counties facilitate people with diverse ancestrial backgrounds, to communicate with one another, English may have defaulted to be the most universal and accessbile language.

In [None]:
lang_counties = []
lang_counts = []
lang_kind = []

for idx, row in diverse_lang.iterrows():
    lang_counties.append(idx)
    lang_counts.append(row['english_only'])
    lang_kind.append('English Only')
    lang_counties.append(idx)
    lang_counts.append(row['language_other_than_english'])
    lang_kind.append('Language other than English')

lang_data = {}
lang_data['County'] = lang_counties
lang_data['Percent'] = lang_counts
lang_data['Language'] = lang_kind
lang_data = pd.DataFrame(lang_data).sort_values(by='Percent')
fig = px.bar(lang_data, x="Percent", y="County", color="Language", title="Diversity in Language in WA Counties", height=900)

# let's test the strength of the relationships of the two variables: 
# diversity in each county and english only being spoken in households
fig = px.scatter(diverse_lang, x='diversity_ratio', y='english_only', trendline='ols', 
                    title='English only being spoken in Various Ancestrially Diverse WA County',
                    labels=dict(diversity_ratio="Ancestrial Diversity", english_only="Households that Speak Only English"))

##### &emsp; First, to gauge the diversity of languages at a smaller scale, we chose to group all non-English speaking households together against the exclusively English speaking households. We were intrigued by the variability of the households that spoke English only and thought that the English Only households provided a range that was wide enough that could potentially show us a relationship to the ancestrial background when utilized as a predicted variable. 
##### &emsp; The scatterplot shows of the rates of English Only speakng households, as determined above, by ancestrial diversity. We can clearly see a postiive and fairly strong linear relationship, indicating that there is some correlation between these two variables. Using the least-ordinary squares table, it shows that the R-squared value is 0.825, meaning that 82.5% of the variability in the English Only households can be attributed to the ancestrial diversity. Also, with the p-value being less than the 5% significance level of 0.000, we can reject the notion that English Only households do not have a correlation with ancestrial diversity. Before running these analyses, we expected the results to have an inverse relationship, as we assumed that more ancestral diversity was an indicator of more non-English languages being used. We were surprised to have found that this was the opposite and to conclude that it is to a low extent that Washington residents continue to practice their ancestral culture through language. When taking on a communicative lens, many times, immigrant families purposefully choose not to teach their children their native tongue in hopes of faster assimulation by picking up English and becoming Americanized. Another reason for this positive correlation could be that when counties facilitate people with diverse ancestrial backgrounds, to communicate with one another, English may have defaulted to be the most universal and accessbile language.

![](pngs/language_diversity_per_county.png)

![](pngs/household_diversity.png)

In [6]:
X = diverse_lang['english_only']
Y = diverse_lang['diversity_ratio']
X = sm.add_constant(X)
m = sm.OLS(Y.astype(float), X.astype(float))
r = m.fit()
r.summary()

0,1,2,3
Dep. Variable:,diversity_ratio,R-squared:,0.825
Model:,OLS,Adj. R-squared:,0.821
Method:,Least Squares,F-statistic:,179.6
Date:,"Tue, 12 Mar 2024",Prob (F-statistic):,5.6e-16
Time:,09:01:11,Log-Likelihood:,97.342
No. Observations:,40,AIC:,-190.7
Df Residuals:,38,BIC:,-187.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0923,0.024,3.844,0.000,0.044,0.141
english_only,0.3741,0.028,13.402,0.000,0.318,0.431

0,1,2,3
Omnibus:,1.26,Durbin-Watson:,1.763
Prob(Omnibus):,0.533,Jarque-Bera (JB):,1.089
Skew:,-0.205,Prob(JB):,0.58
Kurtosis:,2.304,Cond. No.,14.0


***
## **Impact and Limitations**
***
##### &emsp; Our greatest informant and limitation was our reliance on the US Census Bureau for reliable data. As 4 out of 5 of the datasets we chose to analyze were obtained from this singular, albeit grand, source, we believe that it is natural that our analysis would extend the same biases and limitations. As mentioned above in the data setting section, generally, census data is prone to response bias, with social desirability bias arguably the most susceptible to the responders as it deals with demographic data. An implication of this can be that responders of certain cultures and demographics that are less recognized in the United States may not have felt the same inclination for declaring their ancestry and language compared to those who are in communities that are actively celebrated. Especially as we look at ancestry in Washington state, we have to recognize that there are communities that had their lineage systemically erased from archives and history, such as the African and Indigenous Americans, and therefore the responders may not be able to report their full ancestral background. We also noticed the lack of AAPI ancestry in our data. With the dataset’s notes stating that “this table lists only the largest ancestry groups…” and that it only recorded the first and second ancestry, we questioned how the US Census defined “largest” and what groups made that “largest” cutoff. While they also redirected to “more detailed tables” for more ancestral groups, we agreed that this was an example of erasure, as it initially conveyed that there were no people of AAPI ancestry in Washington state. 
##### &emsp; As for our graphs, we recognize that there is skewing of the axes for a closer look at the visualizations, which can lead to unintentional misinformed opinions and understandings. While we did not see a correlation between the per pupil expenditures and the languages spoken in each county, we can see that it is apparent that there is a high proportion of Spanish speakers in comparison to the other non-English languages. This is true for the visualizations showing educational attainment by language and, although it is a loose trend, the counties with the most amount of Spanish speaker seemed to have the least amount of educational attainment. We found trends interesting, as we first thought that language spoken did not have any influence in the ways counties are supported and uplifted, however we urge viewers to think critically about other socioeconomic factors that we failed to highlight in our project that are popular indicators of wellbeing (eg. race, income). We also acknowledge that we were only able to analyze one year’s worth of data, which the reliability of our project. If we were able to invest more time into this project, we envision that we would have been able to produce much more reliable and valid results.  


***
## **Challenge Goals**
***

### **Multiple Datasets:**
##### &emsp; As mentioned in the data setting section, we used multiple datasets from the US Census in order to compare trends across Washington counties by connecting location based data, the Washington county, as our primary key. For example, to answer our third research question, we have taken data about the spoken language diversity of Washington counties and also data about county-specific ancestral backgrounds and have merged them together with the county being the main denominator. By using multiple datasets, we were able to achieve higher relationality and diversity in the kinds of data that were analyzed.
### **New Libraries:** 

##### &emsp; We have mainly used Plotly Express and Plotly Graph Objects from the Plotly library to create interactive visualizations to support the communication of our analyses. We believe that including visualizations that have some form of interactivity with a viewer encourages more direct attention and interpretation. We especially chose to lean into the hover label feature of Plotly, which allows for the viewer to hover over a data point on a visualization and indicate its x and y values without having to physically reference the axis. We found that because we wanted to communicate relationships between features, stacking the plot on top of one another created a sense of relativity, with the hover labels helping to highlight the more granular information, allowing for data to be conveyed more transparently.

### **Messy Data**:
##### &emsp; We initially did not believe that we would have to deal with messy data, as our first glance at our datasets seemed to have been stored neatly. We quickly realized that the data from the robust US Census’s interactive interface did not translate as well when reading them as CSVs for analysis. Multi-indexes and column sections and headings were particularly messy, as we had to manually separate them with regex patterns, pivot the tables, and rename the columns to remove non-breaking and white spaces. As we were also using multiple datasets that were from collectors outside of the US Census, we also had to normalize the county and district names to make sure joins and merges of the separate tables were performed correctly without data loss.
 
### **Result Validity**:
##### &emsp; After looking at the results of our first visualization from our third research question, we deemed that performing a linear regression on the two variables, ancestral diversity and the number of households that spoke english exclusively, could produce a meaningful output. Although not as major as the other challenge goals, we felt that the inclusion of statistical analysis to answer this question helped strengthen our overall understanding of the relationships about language diversity within Washington state. 


***
## **Plan Evaluation**
***
### **Time Utilized:**
##### &emsp; Our proposed work plan time estimates were relatively accurate. Our estimates were close to reality for the most part as we were able judge based on previous experience in working with datasets and visualizations from class assignments in CSE 163. The two areas that took longer than anticipated were cleaning the data and creating the visualizations. Since we initially did not anticipate having to clean the data much, we estimated a lower amount of time spent on this step. However, as mentioned previously, the datasets ended up being quite messy and required a lengthy amount of cleaning code and time. For creating the visualizations, we took longer than expected since we had not gotten much experience using Plotly as it was a new library from the one(s) using during class assignments. Other than these two areas, we were able to follow our proposed plan fairly accurately and spent a fair amount of time on each step of this research project.

### **Developing Code:**
##### &emsp; For each high-level section, the work was divided into half. Before each member goes off to work independently, conversations were held ahead of time in order to discuss the scope and define what half of the workload will look like. For each research question, instead of directly using pandas and python to wrangle the data, we seperated chunks of code to establish a basis for testing and validation later on. 

### **Testing Code:**
##### &emsp; For each research question, as well as the cleaning functions, the member responsible for the majority of the code worked on the corresponding test cases. This allowed for the person most familiar with the code and thought process to curate the test cases for them.

### **Coordinating Work:**
##### &emsp; Throughout this project, all members discussed with one another before and after starting each high-level section. All members notified the other team member about potential challenges or conflicts as soon as possible, and constant communication was encouraged to make sure no member felt unsupported.


***
## **Testing**
***
##### &emsp; We tested our code by using a variety of assertations, doctests, and statistical regression evaluations. The majority of testing was done using assert statements. We compared the data values contained in each created graph with the values held in each dataframe corresponding to its respective graph. This helped ensure that no values were lost or added during the graph creation. For the cleaning functions, we used doctests and a smaller testing data file. The doctests can be found within the docstring inside each cleaning function located in the Results and Code section of this report, and the smaller data file import can be found at the top of the Results and Code Importing and Cleaning section code cell. We know that our code computes the expected result because our assertation tests include statements to check sorted groupings of values. This not only ensures that all values we want to include in the graph are included, but also guaruntees that the values are presented in the correct order or manner. By comparing sorted values, we can make sure that the highest and lowest values are in the right place in relativity to each other. This is increasingly important as many of our visualizations rely on sorted values in order to make our conclusions and takeways more clear and concise. Additionally, to check for the goodness of the linear model used in the last research question, we found the mean squared error of the model and found it to be at around 0.2, which is fairly close to 0, meaning that the linear model was a good fit for the data it was being used on. The code for all of our testing (excluding the doctests found in each cleaning function) can be found below.

##### Assert statements for RQ1 code:

In [None]:
# Undergraduate data
assert sorted(fig.to_dict()['data'][0]['y']) == sorted(sub_enrollment['college,_undergraduate'].tolist()), "Undergrad data does not match expected"
# Postgrad data
assert sorted(fig.to_dict()['data'][1]['y']) == sorted(sub_enrollment['graduate,_professional_school'].tolist()), "Postgrad data does not match expected"
# Spanish data
assert sorted(fig.to_dict()['data'][2]['y']) == sorted(sub_language['spanish'].tolist()), "Spanish data does not match expected"
# Indo-Euro data
assert sorted(fig.to_dict()['data'][5]['y']) == sorted(sub_language['other_indo-european_languages'].tolist()), "Indo-European data does not match expected"
# Asian data
assert sorted(fig.to_dict()['data'][8]['y']) == sorted(sub_language['asian_and_pacific_island_languages'].tolist()), "Asian & Pacific Islander data does not match expected"
# Other data
assert sorted(fig.to_dict()['data'][11]['y']) == sorted(sub_language['other_languages'].tolist()), "Other language data does not match expected"
# English data
assert sorted(fig.to_dict()['data'][14]['y']) == sorted(sub_language['speak_only_english'].tolist()), "English data does not match expected"
# Other than English data
assert sorted(fig.to_dict()['data'][17]['y']) == sorted(sub_language['speak_a_language_other_than_english'].tolist()), "Other than English data does not match expected"
# Counties
assert sorted(fig.to_dict()['data'][0]['x']) == sorted(enrollment.index.get_level_values('County')), "Counties not accounted for"

##### Assert statements for RQ2 code:

In [None]:
# Testing for all counties included and are sorted by the PPE
ppe_languages_sorted = ppe_languages.sort_values(by='ppe', ascending=False).index
for i in range(len(fig.to_dict()['data'])):
    all_labels = fig.to_dict()['data'][i]['x']
    for label, idx in zip(all_labels, ppe_languages_sorted):
        assert(label.split(' ', 1)[1] == idx), "Data does not match expected"

##### Assert statements and statistical analysis evaluations for RQ3 code:

In [None]:
# Testing for all data points to be on the plot
assert sorted(fig.to_dict()['data'][0]['y']) == sorted(diverse_lang['english_only'].tolist()), "Data points do not match expected"

In [7]:
# The mean squared error for checking the goodness of fit of the linear model
sm.tools.eval_measures.mse(diverse_lang['english_only'], diverse_lang['diversity_ratio'], axis=0)

0.20066900124783196

***
## **Collaboration**
*** 
##### &emsp; In the process of this project, we consulted the [Plotly documentation](https://plotly.com/python/) for guidance on using the Plotly library, University of Washington's Professor Ott Toomet's textbooks ([1](https://faculty.washington.edu/otoomet/machinelearning-py/linear-regression.html), [2](https://faculty.washington.edu/otoomet/machineLearning.pdf)) on Python for Machine Learning for linear regression and general statistical analysis and interpretation, and [Stack Overflow](https://stackoverflow.com/) for advice on debugging code. We have not used generative AI in any way for this project.