# Assignment 2

## Instructions - Read this first!

This is an individual homework assignment. This means that:

- You may discuss the problems in this assignment with other students in this course and your instructor/TA, but YOUR WORK MUST BE YOUR OWN.
- Do not show other students code or your own work on this assignment.
- You may consult external references, but not actively receive help from individuals not involved in this course.
- Cite all references outside of the course you used, including conversations with other students which were helpful. (This helps us give credit where it is due!). All references must use a commonly accepted reference format, for example, APA or IEEE (or another citation style of your choice).

If any of these rules seem ambiguous, please check with with your instructor for help interpreting them.

We suggest completing this assignment using the provided notebook. Each question should be answered using a SQL query (or combination or SQL queries) unless the text indicates that you may do something else. You may submit your queries embedded in Python, using SQLAlchemy or the MySQL Connector, or as plain text in Markdown.

## When you submit your work

Your submission will be graded manually. To ensure that everything goes smoothly, please follow these instructions to prepare your notebook for submission to the D2L Dropbox for Assignment 2:

- Please remove any print statments used to test your work (you can comment them out)
- Please provide your solutions where asked; please do not alter any other parts of this notebook.
- If you need to add cells to test your code please move them to the end of the notebook before submission- or you may included your commented out answers and tests in the cells provided

## Introduction

 In this assignment, we will focus familiarizing you with using SQL for data exploration, and continuing to cultivate a sense of curiosity about the datasets you encounter. We will be using a table generated by Statistics Canada (2022), based upon census data, about the knowledge of languages by area in Canada. This table has been pre-cleaned (although potentially not entirely) so that you can work on the actual assignment more quickly.
 
 To begin, start by importing the provided CSV into your own SQL database using SQLAlchemy, by filling in the lines below:

In [1]:
import pandas as pd
import sqlalchemy as sq

# read in your CSV as a dataframe
speakers = pd.read_csv("language_speakers.csv")
metadata = pd.read_csv("MetaData.csv")
# connect to your database; include a cell at the bottom of this notebook to dispose of your engine object
user = 'andrii_voitkiv1'
password = '3RMB68NSY'
host = 'datasciencedb.ucalgary.ca'
database = 'andrii_voitkiv1'

engine = sq.create_engine('mysql+mysqlconnector://{0}:{1}@{2}/{3}'.format(user, password, host, database))
# write your dataframe into a table
speakers.to_sql(name='speakers', con=engine, if_exists='replace')
metadata.to_sql(name='metadata', con=engine, if_exists='replace')
# demonstrate that your import has been successful by reading your database table as a dataframe,
# and print some information (not the entire table) about your second dataframe
df_speakers = pd.read_sql_table("speakers", engine)
df_metadata = pd.read_sql_table("metadata", engine)
print(df_speakers.head())
print(df_metadata.head())

   index  REF_DATE     GEO           DGUID Statistics (3)  \
0      0      2021  Canada  2021A000011124          Count   
1      1      2021  Canada  2021A000011124          Count   
2      2      2021  Canada  2021A000011124          Count   
3      3      2021  Canada  2021A000011124          Count   
4      4      2021  Canada  2021A000011124          Count   

     Knowledge of languages (324)  \
0  Total - Knowledge of languages   
1  Total - Knowledge of languages   
2  Total - Knowledge of languages   
3              Official languages   
4              Official languages   

  Single and multiple responses of knowledge of languages (3) SCALAR_FACTOR  \
0  Total - Single and multiple responses of knowl...                  units   
1         Single responses of knowledge of languages                  units   
2       Multiple responses of knowledge of languages                  units   
3  Total - Single and multiple responses of knowl...                  units   
4         Singl

## Part A: Warm-up Questions (12 marks)

Answer the questions below, including the queries you used where necessary. Not all questions will require writing a SQL query to answer.

**(1 mark)**

How many records are there in total?

In [71]:
query0 = '''
SELECT COUNT(*) FROM speakers;
'''

query_table0 = pd.read_sql_query(query0, engine)
print(query_table0)

# verify with python
# len(df_speakers)

   COUNT(*)
0      7536


**(1 mark)**

How many different languages are spoken in Canada?

To answer this question I studied the data and found out that the column named `Knowledge of languages (324)` has parent-child hierarchy, meaning there are subgroups of languages. For example, Celtic languages group consists of Irish, Scottish Gaelic, Welsh and Celtic languages, n.i.e.
1. I don't want to count **parent languages** (group name languages) as _unique_.

From the first glance, there is a pattern - language attribute that ends with `%languages` is parent group's name. SPOILER: that's not the case!
On the same web page as the data source I found metadata (Link below in this cell) that assigns code to language and it's parent.

**Methodology:**
To find if the language is _unique_ it **shouldn't be a parent** to any other languages. Technically it means that numbers from a column `Member ID` (language code) should not appear in a column `Parent Member ID` (parent language code for that language).
How I implemented this:
1. Loaded the metadata from the link in the bottom of this cell;
2. Cleaned (removed all unused columns except `Member ID` (language), `Member ID` and `Parent Member ID`) in Excel;
3. Read this CSV file using pandas (variable name in this notebook `df_metadata`);
4. Loaded to database with sqlalchemy;
5. Created a VIEW that is a merge of the main dataset and metadata and calculated conditional column if the language is unique:
    1. Changed some column names from main dataset to save space;
    2. Joined two datasets on Language name;
    3. Calculated column if language is in Parent, meaning if it's a parent to other languages. If not, then it's unique (with value 1). The column name `Unique_flag`.

**Result:**
The difference between two ways of doing this (1. remove "%languages"; 2. check metadata) is only one language that is `Hmong-Mien languages`. The last one is the unique language because it _doesn't have any children languages_. This can be confirmed with metadata. And removing this one with filtering text is a **mistake**.

**Reference:**
Metadata - https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadCubeMetaData-nonTraduit.action?pid=9810021701&csvLocale=en

In [3]:
# Create VIEW as described above
query_view = '''
CREATE VIEW speakers_full AS
SELECT s.GEO, s.`Knowledge of languages (324)` AS Languages,
       s.`Single and multiple responses of knowledge of languages (3)` AS Responses,
       s.VALUE,
       m.`Member ID`, m.`Parent Member ID`,
       (CASE WHEN (m.`Member ID` NOT IN (SELECT DISTINCT metadata.`Parent Member ID` FROM metadata))
         THEN 1
         ELSE 0 END) AS Unique_flag
FROM speakers s
LEFT OUTER JOIN metadata m
    ON s.`Knowledge of languages (324)` = m.`Member Name`;
'''

query_select_view = '''
SELECT * FROM speakers_full LIMIT 10;
'''
query_table_view = pd.read_sql_query(query_select_view, engine)
print(query_table_view)

                         GEO                       Languages  \
0                     Canada  Total - Knowledge of languages   
1                     Canada  Total - Knowledge of languages   
2                     Canada  Total - Knowledge of languages   
3  Newfoundland and Labrador  Total - Knowledge of languages   
4  Newfoundland and Labrador  Total - Knowledge of languages   
5  Newfoundland and Labrador  Total - Knowledge of languages   
6       Prince Edward Island  Total - Knowledge of languages   
7       Prince Edward Island  Total - Knowledge of languages   
8       Prince Edward Island  Total - Knowledge of languages   
9                Nova Scotia  Total - Knowledge of languages   

                                           Responses     VALUE  Member ID  \
0  Total - Single and multiple responses of knowl...  36328480          1   
1         Single responses of knowledge of languages  21377910          1   
2       Multiple responses of knowledge of languages  14950565  

In [77]:
# Count unique languages
query1 = '''
SELECT SUM(Unique_flag) AS unique_languages
FROM speakers_full
WHERE GEO="Canada" AND Responses = "Total - Single and multiple responses of knowledge of languages";
'''
query_table1 = pd.read_sql_query(query1, engine)
print(query_table1)

   unique_languages
0             264.0


There are 264 unique languages spoken in Canada.

**(1 mark)**

Is there data for all languages, for each province or territory?

In [80]:
query2 = '''
SELECT GEO, COUNT(Languages) AS unique_languages
FROM
    (SELECT GEO, Languages
     FROM speakers_full
     WHERE Responses = 'Multiple responses of knowledge of languages'
       AND GEO <> "Canada"
       AND Unique_flag = 1) a
GROUP BY GEO ORDER BY unique_languages DESC ;
'''
query_table2 = pd.read_sql_query(query2, engine)
print(query_table2)

                          GEO  unique_languages
0            British Columbia               253
1                     Ontario               235
2                     Alberta               235
3                      Quebec               219
4                    Manitoba               203
5                Saskatchewan               198
6                 Nova Scotia               167
7               New Brunswick               155
8   Newfoundland and Labrador               124
9       Northwest Territories               100
10       Prince Edward Island                99
11                      Yukon                91
12                    Nunavut                60


There are 264 unique languages spoken in Canada. None of the provinces or territories represent all languages. The widest variety of languages is in British Columbia, and the smallest is in Nunavut (253 and 60, respectively).

**(8 marks)**

Explain what each of the following columns is used for. You may use the original page to guide your explanation (but you should cite it)
* GEO
* Statistics
* Knowledge of languages
* Single and multiple responses of knowledge of languages (2 marks)
* COORDINATE (2 marks)
* VALUE

GEO - (text) province or territory where respondents were polled;
Statistics - (text) method of describing data in our case a count of answers by respondents on a given question;
Knowledge of languages - (text) refers to a language the person can conduct a conversation in;
Single and multiple responses of knowledge of languages (2 marks) -(text) a respondent provide one language only for a single and two or more languages for multiple responses of knowledge of languages;
COORDINATE (2 marks) - describes the coded columns’ values - a numerical label. The first number represents province, the last number means responses of knowledge of language (total, single, multiple) and the second last represents language;
VALUE - (bigint) is a descriptive number of respective statistics.

**(1 mark)**

Is it possible to tell how many individuals live in Canada based on this dataset? Why or why not?

It is impossible to tell how many individuals live in Canada on this dataset. Because we have multiple responses from some respondents, and we cannot reverse engineer that. Multiple responses mean that a respondent gave two or more answers.

## Part B: Simple questions (10 marks) 

For these queries, run a query which provides the answer.

**(2 marks)**

Which province or territory has the highest number of individuals who can speak multiple languages?

In [4]:
query3 = '''
SELECT b.GEO AS highest_indiv_multi_lang_prov
FROM
    (SELECT a.GEO, SUM(a.VALUE) AS unique_languages
     FROM
         (SELECT GEO, VALUE
          FROM speakers_full
          WHERE Responses = 'Multiple responses of knowledge of languages'
            AND GEO <> "Canada"
            AND Unique_flag = 1) a
     GROUP BY GEO ORDER BY unique_languages DESC) AS b
LIMIT 1;
'''

query_table3 = pd.read_sql_query(query3, engine)
print(query_table3)

  highest_indiv_multi_lang_prov
0                       Ontario


**(2 marks)**

Which provinces or territories has the most diverse number of languages spoken?

In [5]:
query4 = '''
SELECT b.GEO, b.unique_languages AS unique_languages_count
FROM
    (SELECT a.GEO, COUNT(a.Languages) AS unique_languages
     FROM
         (SELECT GEO, Languages
          FROM speakers_full
          WHERE Responses = 'Multiple responses of knowledge of languages'
            AND GEO <> "Canada"
            AND Unique_flag = 1) a
     GROUP BY GEO ORDER BY unique_languages DESC) AS b;
'''

query_table4 = pd.read_sql_query(query4, engine)
print(query_table4)

                          GEO  unique_languages_count
0            British Columbia                     253
1                     Ontario                     235
2                     Alberta                     235
3                      Quebec                     219
4                    Manitoba                     203
5                Saskatchewan                     198
6                 Nova Scotia                     167
7               New Brunswick                     155
8   Newfoundland and Labrador                     124
9       Northwest Territories                     100
10       Prince Edward Island                      99
11                      Yukon                      91
12                    Nunavut                      60


The top 5 diverse languages provinces are: BC, Ontario, ALberta, Quebec and Manitoba

**(2 marks)**

Which provinces or territories have the most diverse number of Indigenous languages spoken?

In [88]:
query5 = '''
SELECT GEO, COUNT(Languages) AS indegenous_languages
FROM speakers_full
WHERE `Member ID` IN
      (SELECT DISTINCT `Member ID`
       FROM speakers_full
       WHERE `Member ID` BETWEEN 6 AND 96 AND Unique_flag = 1)
  AND Responses = 'Multiple responses of knowledge of languages'
  AND GEO <> 'Canada'
GROUP BY GEO
ORDER BY indegenous_languages
DESC;
'''

query5_table = pd.read_sql_query(query5, engine)
print(query5_table)

                          GEO  indegenous_languages
0            British Columbia                    62
1                     Alberta                    46
2                     Ontario                    43
3                      Quebec                    31
4                    Manitoba                    21
5                Saskatchewan                    21
6       Northwest Territories                    18
7                       Yukon                    16
8                 Nova Scotia                    13
9               New Brunswick                    11
10  Newfoundland and Labrador                     9
11                    Nunavut                     6
12       Prince Edward Island                     2


**(2 marks)**

As a percentage of all language speakers in each province or territory, how commonly is French spoken?

In [91]:
query6 = '''
SELECT A.GEO, A.Total, B.French, B.French / A.Total AS Proportion
FROM
( SELECT GEO, VALUE as Total
FROM speakers_full
WHERE Responses = 'Total - Single and multiple responses of knowledge of languages'
  AND GEO <> "Canada"
GROUP BY GEO ) AS A
LEFT OUTER JOIN
( SELECT GEO, VALUE AS French
FROM speakers_full
WHERE Responses = 'Total - Single and multiple responses of knowledge of languages'
  AND Languages LIKE ('French')
  AND GEO <> "Canada"
GROUP BY GEO ORDER BY French ) AS B ON B.GEO = A.GEO ORDER BY Proportion DESC ;
'''

query6_table = pd.read_sql_query(query6, engine)
print(query6_table)

                          GEO     Total   French  Proportion
0                      Quebec   8308480  7786740      0.9372
1               New Brunswick    759195   317825      0.4186
2                       Yukon     39590     5740      0.1450
3        Prince Edward Island    150480    19445      0.1292
4                     Ontario  14031750  1550545      0.1105
5       Northwest Territories     40380     4445      0.1101
6                 Nova Scotia    955855    99300      0.1039
7                    Manitoba   1307190   111790      0.0855
8            British Columbia   4915945   327350      0.0666
9                     Alberta   4177715   260415      0.0623
10  Newfoundland and Labrador    502100    26130      0.0520
11               Saskatchewan   1103205    52065      0.0472
12                    Nunavut     36605     1455      0.0397


**(2 marks)**

As a percentage of all Chinese speakers in the country, how many Chinese speakers do not speak Mandarin?

In [92]:
query7 = '''
SELECT A.mandarin,
       B.chinese,
       (B.chinese - A.mandarin) / B.chinese AS non_mandarin_prop
FROM
    ( SELECT VALUE as mandarin
      FROM speakers_full
      WHERE Responses = 'Total - Single and multiple responses of knowledge of languages'
        AND GEO = "Canada"
        AND Languages LIKE 'Mandarin' ) AS A,
    (SELECT VALUE as chinese
     FROM speakers_full
     WHERE Responses = 'Total - Single and multiple responses of knowledge of languages'
       AND GEO = "Canada"
       AND Languages LIKE 'Chinese languages' ) AS B;
'''

query7_table = pd.read_sql_query(query7, engine)
print(query7_table)

   mandarin  chinese  non_mandarin_prop
0    987300  1528860             0.3542


## Part C: Detailed analysis (20 marks)

Now consider being given a task to make sense of the distribution of languages spoken in Canada using this dataset.

**Question 1 (4 marks)**

Create two guiding questions to use in your analysis, and include them below as Markdown. As a starting point (and remember you are not limited to only these!), you may want to consider the following ideas:
- focus on a specific set of languages which are interesting to you
- consider the effect of language speakers who speak only one or multiple languages
- differences between specific provinces and territories in terms of patterns of language speakers

1. What is the most diverse language group? In other words, which **parent group** has the widest variety of unique languages?
2. What is the effect of people who can speak only one language that is their mother language?

**Question 2 (12 marks)** 

Write at least four queries (two for each question) which you believe will address one of your guiding questions. Clearly indicate which queries address your questions. You may wish to include a comment to explain why this query will help address your question.

### Guiding question #1:
####  What is the most diverse language group? In other words, which **parent group** has the widest variety of unique languages?

First, let's find the ID of the group with the greatest variety. It means, it will be the parent to other languages (`Parent Member ID`)

In [10]:
query8 = '''
SELECT b.Languages, b.`Member ID`, a.diversity
FROM
    (SELECT Languages, `Member ID`
     FROM speakers_full
     WHERE Responses = 'Total - Single and multiple responses of knowledge of languages'
       AND GEO = "Canada") AS b
        LEFT OUTER JOIN
        (SELECT *, COUNT(`Parent Member ID`) AS diversity
         FROM metadata
         GROUP BY `Parent Member ID`) AS a
            ON a.`Parent Member ID` = b.`Member ID`
WHERE diversity > 0
ORDER BY diversity DESC
LIMIT 5;
'''

query8_table = pd.read_sql_query(query8, engine)
print(query8_table)

                  Languages  Member ID  diversity
0     Niger-Congo languages        256         24
1  Non-Indigenous languages         97         19
2    Austronesian languages        128         16
3      Indo-Aryan languages        215         16
4      Indigenous languages          6         13


Now, when I know that Nigero-Congo languages group (`Member ID` is 256) is the most diverse and counts 24 languages. I'd like to know if it's _truly_ diverse, meaning proportion of each language is more or less equal.

In [13]:
query9 = "SET @total_nigerokongo_speakers := " \
         "(SELECT SUM(a.african_speakers) " \
         "FROM (SELECT VALUE as african_speakers FROM speakers_full WHERE Responses = 'Total - Single and multiple responses of knowledge of languages' AND GEO = 'Canada' AND `Parent Member ID` = 256) as a);"

query10 = '''
SELECT b.Languages,
       b.african_speakers,
       b.african_speakers / @total_nigerokongo_speakers AS proportion,
       b.cumsum / @total_nigerokongo_speakers           as cumsumprop
FROM (SELECT *, SUM(a.african_speakers) OVER (ORDER BY a.african_speakers DESC) AS cumsum
      FROM (SELECT Languages, VALUE as african_speakers
            FROM speakers_full
            WHERE Responses = 'Total - Single and multiple responses of knowledge of languages'
              AND GEO = "Canada"
              AND `Parent Member ID` = 256
            ORDER BY VALUE DESC) AS a) AS b;
'''

engine.execute(query9) # execute query set sum of selected column
query_table10 = pd.read_sql_query(query10, engine)
print(query_table10)

                          Languages  african_speakers  proportion  cumsumprop
0                           Swahili             57290      0.1811      0.1811
1                            Yoruba             50935      0.1610      0.3421
2     Niger-Congo languages, n.i.e.             43620      0.1379      0.4800
3                        Akan (Twi)             26240      0.0830      0.5630
4                           Lingala             25725      0.0813      0.6443
5                              Igbo             18365      0.0581      0.7024
6                   Rundi (Kirundi)             14825      0.0469      0.7493
7              Kinyarwanda (Rwanda)             13515      0.0427      0.7920
8                             Wolof             12945      0.0409      0.8329
9   Fulah (Pular, Pulaar, Fulfulde)              7845      0.0248      0.8577
10                            Shona              7285      0.0230      0.8807
11                              Edo              4950      0.015

As we can see from the output above, top 12 languages (half of the whole group) count for 90% of all speakers. And the rest half of languages count for less than 2% each of total speakers from the Nigero-Congo languages group.

### Guiding question #2:
#### What is the effect of people who can speak only one language that is their mother language?

1. People with a single response means this language is their _mother tongue_ language.
2. Again, people who gave _multiple responses_ means they can communicate in two or more languages.
3. There are two official languages: English and French.
4. It's highly probable that people who gave multiple responses including non-official languages that non-official lang may be their mother tongue lang in addition to other languages they can speak (for example English or another language that is closely related to their language).

In [23]:
query11 = '''
SELECT a.Languages,
       a.single + b.multiple              AS total,
       a.single,
       b.multiple,
       a.single / (a.single + b.multiple) AS prop
FROM (SELECT Languages, VALUE AS single
      FROM speakers_full
      WHERE Responses = 'Single responses of knowledge of languages'
        AND Unique_flag = 1
      GROUP BY 1) AS a
         LEFT OUTER JOIN
     (SELECT Languages, VALUE AS multiple
      FROM speakers_full
      WHERE Responses = 'Multiple responses of knowledge of languages'
        AND Unique_flag = 1
      GROUP BY 1) AS b ON a.Languages = b.Languages
ORDER BY prop DESC;
'''

query_table11 = pd.read_sql_query(query11, engine)
print(query_table11)

                             Languages     total    single  multiple    prop
0                              English  31628570  17218180  14410390  0.5444
1                               French  10563235   3579315   6983920  0.3388
2                             Rohingya       760       170       590  0.2237
3               Sign languages, n.i.e.      7450      1575      5875  0.2114
4                          S'gaw Karen      5030       740      4290  0.1471
..                                 ...       ...       ...       ...     ...
179                               Igbo     18365        25     18340  0.0014
180  Pampangan (Kapampangan, Pampango)     10555        15     10540  0.0014
181                              Shona      7285        10      7275  0.0014
182                         Hiligaynon     16385        20     16365  0.0012
183                            Swedish     14615        10     14605  0.0007

[184 rows x 5 columns]


From this output:
1. Official languages are at the top of the table. Let's take English. There are many single responses (54%) which means that they are not interested in studying another language as there is no need for that.
2. Low numbers for non-official languages means that immigrants (with high probability) have to know at least English/French to be able to communicate. Let's take the last row which is Swedish. There are only 10 single responses (10 people from Sweden that don't speak any other language except their own language). And there are almost 15 thousands people from Sweden who can speak two or more languages (this is an assumption, as people from Norway can speak Swedish as well and can be included in this number).
3. Relatively high ratio of single responses in non-official languages are Sign languages (are expressed through manual articulation).

**Question 3 (4 marks)**

What kind of data would be interesting to receive in combination with the data in this table? Discuss how you could use this additional information to extend one of your guiding questions.

Regarding my second guiding question, it is interesting to see how the median income compares between people who speak only one unofficial language with those who can speak multiple languages. The null hypothesis is that it should be lower as it's hard to find a well-paid job when you can not communicate effectively. Even though it's obvious, it's interesting to see how much the median income differs.

## Part D: Reflection (5 marks)

In 100 to 250 words, identify a concept you have found difficult or confusing from this assignment. Reflect on how your previous learning or experience helped you to understand this concept. Provide your reflection using markdown in the cell below

The most challenging and time-consuming part of this assignment was making sense of the data I had to deal with. Deriving the meaning of each attribute was critical to make sure the inferences were clear.
One of the columns named `Knowledge of languages (324)` has a parent-child hierarchy, and finding and supplementing the main table with the metadata table helped to avoid some mistakes.
Regarding the SQL part, I used several "subqueries" and "joins" as well as setting variables to get the desired result. This confused me initially, as you have to split the process into a few steps and use the result of one "select" as an input for another - inside-out sequence. I often checked the result of one simple query to see if I was getting the desired result before complicating it - debugging in SQL is not straightforward.
In addition, creating new columns using transformed/aggregated values is more challenging than with the pandas library, for example. Though many things, such as filtering, selecting columns and joins, are similar.

In [None]:
# Use this cell to include some code to dispose of your SQLAlchemy engine object


## References

Statistics Canada (Government of Canada). 2022. Knowledge of languages by age and gender: Canada, provinces and territories, census metropolitan areas and census agglomerations with parts. Retrieved October 15, 2022 from https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021701