# Workbook #2: Univariate statistics
This workbook will cover important information about univariate statistics and regression review.

 <img src="https://www.institute4learning.com/blog/wp-content/uploads/2020/02/data.jpg" width=300 height=300 />

### Variables
Generally, a variable is a quantitiative measure we use to represent something. This could be answers from a survey or the measurements of temperture in the month of August 2021. In statistical theory, we define a random variable as the quantititive value assigned to every possible outcome (like a probilitity). Moreover, in statistical theory, there are two types of random variables: discrete and contiunous.
* Discrete variables are variables that have clear boundaries between their values, like categories. For example, heads/tails and Likert scale are discrete variables. Discrete variables must be represented through bar graphs because there is a "finite" number of categories.
* Contiunous variables are variables where their values are numbers. For example, age or percent of Latinx residents are contiunous variables. Contiunous variables have contiunous probability distributions where there are "infinite" number of possibilities between intervals and probabilities.

<img src="https://healthitanalytics.com/images/site/features/_normal/ThinkstockPhotos-645261596.jpg" width=300 height=300 />

### Variables in social statistics
In social statistics, we generally have two kinds of variables:
* Categorical variables are variables similar to discrete variables, where the values are categories/boundaries. For example, race or state (Alabama, Oregon, etc...). 
* Numeric variables are variables similar to contiunous variables, where the values have numeric value to them. For example, percent of Latinx residents or birthweight of child. 

Categorical variables need to handle with more care as compared to numeric variables. For example, you cannot find the mean. You can find the mode of categorical variables. 

You can do a lot of statistics with numeric variables because you can treat them like numbers. You can find the mean and standard deviation. Just keep in mind and always ask yourself...what does it mean?

<div class="alert alert-block alert-warning">
Depending on the variable type, they have to be used and interpreted differently in statistics. So always think through what type of variables you are working with.</div>

<img src="http://cdn.mos.cms.futurecdn.net/sdSHp2akMYc4EoZAoRE77k.jpg" width=300 height=300 />

## Variables in coding and Stata code
It is important to understand how variables are handle in coding and Stata code. Here we will cover variable types and storage types.

<img src="https://static.semrush.com/blog/uploads/media/cd/34/cd34e2cb04a60d0d027c033e64591477/types-of-content-marketing.svg" width=300 height=300 />

### Variable types
Each coding language has it own datatypes. Generally, there is are numeric (have actual number meaning) and string (have meaning in characters and texts). Keep in mind that each coding language has more specific datatypes. Let's review the Stata datatypes:
* Numeric variables -- These variables are numbers (similar to previous definition). You can do calculations with numeric variables like mean or standard deviation. In data view, numeric variables are displayed in black text.
* String variables -- These variables are characters or text. String can be in double quotes. In data view, string variables are displayed in red text.

<img src="https://ophtek.com/wp-content/uploads/2018/04/data-storage.jpg" width=300 height=300 />

### Storage types
Data takes space or storage on your computer. Each coding language has different space formats. Stata's data format is:

| Storage type | Min | Max | Closest to 0 without being 0 | Bytes |
| --- | --- | --- | --- | --- |
| byte | -127 | 100 | +/-1 | 1 |
| int | -32,767 | 32,740 | +/-1 | 2 |
| long | -2,147,483,647 | 2,147,483,620 | +/-1 | 3 |
| float | -1.70141173319 x 10^38 | 1.70141173319 x 10^38 | +/-10^-38 | 4 |
| double | -8.9884656743 x 10^307 | +8.9884656743 x 10^37 | +/-10^-323 | 5 |

<i>Don't confuse integer in the numeric sense with the "integer" storage type in Stata. =10 would be a storage type "byte" in Stata.
Stata also recognizes time. We will cover that more in panel analysis

String variables are stored as str1, str2, ..., str2045, and strL. Where the number after "str" indicates the length of the string variable.

<B>iT IS SO IMPORTANT TO KNOW YOUR VARIABLE TYPES AND STORAGE TYPES. SOME CODES ONLY WORK FOR SPECIFIC DATA AND STORAGE TYPES.

<img src="https://files.realpython.com/media/Pythonic-Data-Cleaning-With-Pandas-and-NumPy_Watermarked.0eccf29b6622.jpg" width=300 height=300 />

## Cleaning data tricks
Sometimes the data we get is messy and we have clean it before we can even calculate mean. I want to show you two important codes for cleaning variable types.
### ENCODE
* encode -- makes a string variable into a numeric. For example, let's say we have survey data with a question of "Are you a smoker" Yes/No. Encode will create a new variable where "Yes" gets a value for it and "No" gets a value for it. When a variable has been "encoded" it is displayed in blue text.

In [41]:
*Let's start by setting up for workspace.
pwd

C:\Users\acade\Documents\teaching\SOC 211 spring 2022\week 2


In [10]:
*Let's change it to our "week 2" folder
cd "C:\Users\acade\Documents\teaching\SOC 211 spring 2022\week 2"

C:\Users\acade\Documents\teaching\SOC 211 spring 2022\week 2


In [2]:
*This opens the data
use https://www.stata-press.com/data/r17/hbp2, clear
*This code prints a descriptive of the dataset.
desc




Contains data from https://www.stata-press.com/data/r17/hbp2.dta
 Observations:         1,130                  
    Variables:             7                  3 Mar 2020 06:47
--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
id              str10   %10s                  Record identification number
city            byte    %8.0g                 City
year            int     %8.0g                 Year
age_grp         byte    %8.0g      agefmt     Age group
race            byte    %8.0g      racefmt    Race
hbp             byte    %8.0g      yn         High blood pressure
sex             str6    %9s                   Sex
--------------------------------------------------------------------------------
Sorted by: 


In [3]:
*This prints out observations in rows 1-5.
list in 1/5


     +-----------------------------------------------------------+
     |         id   city   year   age_grp    race   hbp      sex |
     |-----------------------------------------------------------|
  1. | 8008238923      1   1993     15–19   Black    No   female |
  2. | 8007143470      1   1992     30–34       .    No          |
  3. | 8000468015      1   1988     25–29   Black    No     male |
  4. | 8006167153      1   1991     25–29   Black    No     male |
  5. | 8006142590      1   1991     20–24   Black    No   female |
     +-----------------------------------------------------------+


In [4]:
*This is an example of using encode
encode sex, gen(sex_numeric) label("Respondent's sex (numeric)")

In [5]:
*Let's see if there was a change
list in 1/5
desc



     +----------------------------------------------------------------------+
     |         id   city   year   age_grp    race   hbp      sex   sex_nu~c |
     |----------------------------------------------------------------------|
  1. | 8008238923      1   1993     15–19   Black    No   female     female |
  2. | 8007143470      1   1992     30–34       .    No                   . |
  3. | 8000468015      1   1988     25–29   Black    No     male       male |
  4. | 8006167153      1   1991     25–29   Black    No     male       male |
  5. | 8006142590      1   1991     20–24   Black    No   female     female |
     +----------------------------------------------------------------------+


Contains data from https://www.stata-press.com/data/r17/hbp2.dta
 Observations:         1,130                  
    Variables:             8                  3 Mar 2020 06:47
--------------------------------------------------------------------------------
Variable      Storage   Display    Val

### DESTRING
* destring -- converts a variable from a string to numeric variable. This only works if the string variable ONLY has numbers. Sometimes when you put data into Stata, it converts the variable to a string.

In [10]:
*Reading in data and asking for description of data
use http://www.stata-press.com/data/r13/destring2, clear
desc
list in 1/5




Contains data from http://www.stata-press.com/data/r13/destring2.dta
 Observations:            10                  
    Variables:             3                  3 Mar 2013 22:50
--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
date            str14   %10s                  
price           str11   %11s                  
percent         str3    %9s                   
--------------------------------------------------------------------------------
Sorted by: 


     +------------------------------------+
     |       date         price   percent |
     |------------------------------------|
  1. | 1999 12 10     $2,343.68       34% |
  2. | 2000 07 08     $7,233.44       86% |
  3. | 1997 03 02    $12,442.89       12% |
  4. | 1999 09 00   $233,325.31        6% |
  5. | 199

In [12]:
/*Code for destring
You could use either one of these codes. Note that the second one REPLACES the string variable with the numeric variable.*/
destring date price percent, generate(date2 price2 percent2) ignore("$ ,%")

*destring date price percent, ignore("$ ,%") replace

date: character space removed; date2 generated as long
price: characters $ , removed; price2 generated as double
percent: character % removed; percent2 generated as byte


In [13]:
*Let's make sure it worked.
desc
list in 1/5



Contains data from http://www.stata-press.com/data/r13/destring2.dta
 Observations:            10                  
    Variables:             6                  3 Mar 2013 22:50
--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
date            str14   %10s                  
date2           long    %10.0g                
price           str11   %11s                  
price2          double  %10.0g                
percent         str3    %9s                   
percent2        byte    %10.0g                
--------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.


     +----------------------------------------------------------------------+
     |       date      date2         price      pri

<img src="https://pic.onlinewebfonts.com/svg/img_347619.png" width=300 height=300 />

## Practice time
You want to examine racial and ethnic breakdown of incarceration. You find this excel file of data for 2010. Unfortunately, when you read the file into Stata, it all messy. Your job is clean it so we can use the data.

In [1]:
*First, you must read in the file from the web.
import excel "https://www.prisonpolicy.org/data/race_ethnicity_gender_2010.xlsx", ///
    sheet(Total) clear

(35 vars, 61 obs)


In [2]:
*Second, you drop the observations or rows that are not necessary.
drop in 59/61
drop in 1/4


(3 observations deleted)

(4 observations deleted)


In [3]:
*Third, you only want to keep the information about racial and ethnic identity. So you drop the rest of the variables.
keep A B C D E F G H I J K L

In [4]:
*Fourth, you need to rename the variables with useful names
rename A geoid
rename B geoid2
rename C state
rename D tot_incar
rename E wht_incar
rename F blk_incar
rename G indig_incar
rename H asian_incar
rename I hawpi_incar
rename J other_incar
rename K multirace_incar
rename L lat_incar

In [5]:
*Row 1 just has the variable names so you can drop it now that you are done cleaning
drop in 1

(1 observation deleted)


## Using encode and destring codes, do the following:
* Q: Give state variable a numeric variable.
* Q: Convert all the incarceration population variables into numeric

In [6]:
*This is one way to transform the string variables into numeric variables.
destring tot_incar, replace
destring wht_incar, replace
destring blk_incar, replace
destring indig_incar, replace
destring asian_incar, replace
destring hawpi_incar, replace
destring other_incar, replace
destring multirace_incar, replace
destring lat_incar, replace
*You will know it worked, if in data view, the variables are displayed in black text color.


tot_incar: all characters numeric; replaced as long

wht_incar: all characters numeric; replaced as long

blk_incar: all characters numeric; replaced as long

indig_incar: all characters numeric; replaced as long

asian_incar: all characters numeric; replaced as int

hawpi_incar: all characters numeric; replaced as int

other_incar: all characters numeric; replaced as long

multirace_incar: all characters numeric; replaced as int

lat_incar: all characters numeric; replaced as long


In [7]:
*Next, this is one way to make the string variable of "state" into a numeric variable
encode state, gen(state_num) label("State (numeric variable)")
*You will know if worked if state is in blue text color.

In [8]:
*Let's make sure it worked
desc


Contains data
 Observations:            53                  
    Variables:            13                  
--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
geoid           str63   %63s                  
geoid2          str6    %9s                   
state           str20   %20s                  
tot_incar       long    %10.0g                
wht_incar       long    %10.0g                
blk_incar       long    %10.0g                
indig_incar     long    %10.0g                
asian_incar     int     %10.0g                
hawpi_incar     int     %10.0g                
other_incar     long    %10.0g                
multirace_incar int     %10.0g                
lat_incar       long    %10.0g                
state_num       long    %20.0g     State (numeric variable)
   

* Q: Make a % white people incaceration variable, % Black people incaceration variable, % Indigenous people incaceration variable, and % Latinx people incaceration variable.
* Q: Print summary statistics of % white people incaceration variable, % Black people incaceration variable, % Indigenous people incaceration variable, and % Latinx people incaceration variable

In [9]:
gen wht_per=(100*wht_incar)/tot_incar
gen blk_per=(100*blk_incar)/tot_incar
gen indig_per=(100*indig_incar)/tot_incar
gen lat_per=(100*lat_incar)/tot_incar

summ wht_per blk_per indig_per lat_per







    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     wht_per |         53    55.42296    17.80663   5.669816   89.83419
     blk_per |         53    32.74913    20.98107   2.716373   87.43746
   indig_per |         53    4.141851    7.938771   .0929224   37.54161
     lat_per |         53     15.2114     16.0398    1.87551   97.62694


In [11]:
*Let's make a bow plot showing percent of incarcerated population across racial and ethnic identities.
graph box wht_per blk_per indig_per lat_per
graph export "incarrace_boxplot.png", replace width(3400)



file C:/Users/acade/.stata_kernel_cache/graph0.svg saved as SVG format
file C:/Users/acade/.stata_kernel_cache/graph0.pdf saved as PDF format

(file incarrace_boxplot.png not found)
file incarrace_boxplot.png saved as PNG format


![incarrace_boxplot.png](attachment:incarrace_boxplot.png)

In [23]:
*Let's save our data
save "incarceration2010 1 27 22.dta", replace

file incarceration2010 1 27 22.dta saved


<img src="http://cdn.onlinewebfonts.com/svg/img_431471.png" width=300 height=300 />

### Ways to examine variables (aka univariate statistics)
I want to discuss a couple ways you can examine one variable at time. 

In [13]:
*First, we need to import .dta file from the web
use "https://www.stata-press.com/data/r17/lbw.dta", clear
desc


(Hosmer & Lemeshow data)


Contains data from https://www.stata-press.com/data/r17/lbw.dta
 Observations:           189                  Hosmer & Lemeshow data
    Variables:            11                  15 Jan 2020 05:01
--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
id              int     %8.0g                 Identification code
low             byte    %8.0g                 Birthweight<2500g
age             byte    %8.0g                 Age of mother
lwt             int     %8.0g                 Weight at last menstrual period
race            byte    %8.0g      race       Race
smoke           byte    %9.0g      smoke      Smoked during pregnancy
ptl             byte    %8.0g                 Premature labor history (count)
ht              byte    %8.0g               

<b>Q: What does this tell us? Can anyone say anything about this data from this output?</b>

This data has 189 observations and 11 variables. Based on the variable labels, we can see this is a health data.

<b>Q: What variables are categorical and what variables are numeric? Explain. </b>

Numeric (in traditional sense): low, age, lwt, bwt, ptl, and ftv

Categorical (in traditional sense): race, smoke (both variables are displayed in blue meaning they are categorical variables with numeric values).

low, ht, and ui seem to be categorical variables but are coded are numeric. For example:
* low is coded 1 if bwt<2500
* ht is coded 1 for yes
* ui is coded 1 for yes

## Frequency tables are a great way to examine categorical variables.

In [25]:
*Let's find out what is the breakdown of smokers in the respondents.
tab smoke


     Smoked |
     during |
  pregnancy |      Freq.     Percent        Cum.
------------+-----------------------------------
  Nonsmoker |        115       60.85       60.85
     Smoker |         74       39.15      100.00
------------+-----------------------------------
      Total |        189      100.00


<b>Q: What does this tell us?</b>

This tells us that 39.15% of the sample are smokers.

## We cannot take mean or standard deviation of categorical variables. We can examine mean statistics of a numeric variable across categorical variable.

In [26]:
*We can examine statistics across another variable. (This is more bivariate statistics)
tabstat bwt, by(smoke) stat(mean sd min median max)


Summary for variables: bwt
Group variable: smoke (Smoked during pregnancy)

    smoke |      Mean        SD       Min       p50       Max
----------+--------------------------------------------------
Nonsmoker |  3054.957   752.409      1021      3100      4990
   Smoker |  2772.297  659.8075       709    2775.5      4238
----------+--------------------------------------------------
    Total |  2944.286   729.016       709      2977      4990
-------------------------------------------------------------


<b> Q: What do we find? </b>

We see that smokers report lower average birth weight as compared to non-smokers.

## Sometimes, we can convert numeric variable into a categorical variable. This is sometimes called categorization of numeric variables.

For example, in this dataset...low is categorical variable made from birthweight of child (bwt).

One way I have down this in the past, is to make categories based on the quantiles or terciles of variables.

<div class="alert alert-block alert-warning">
Percentile is a ranking system of where the value ranks in the sample. So when we say something is in 17th percentile, it means 17% of the sample reports below that thing. </div>

In [27]:
*Stata has a command that makes a variable based on percentiles. 
xtile bwt_ter = bwt, nq(3)

In [28]:
*We can check the data here
tab bwt_ter


3 quantiles |
     of bwt |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         63       33.33       33.33
          2 |         63       33.33       66.67
          3 |         63       33.33      100.00
------------+-----------------------------------
      Total |        189      100.00


Note that the sample is equally divided into the three categories because it is a ranking system. 

Here, observations in category 1 report the lowest bwt and observations in category 3 report the highest bwt as compared to the rest of the sample.

In [33]:
*Let's look at the breakdown of racial/ethnic identity of respondents
graph pie, over(race) ///
    plabel(_all name, color(white)) ///
    title("Racial identity of respondents") ///
    legend(off)
graph export "race_piechart.png", replace width(3400)



file C:/Users/acade/.stata_kernel_cache/graph3.svg saved as SVG format
file C:/Users/acade/.stata_kernel_cache/graph3.pdf saved as PDF format

file race_piechart.png saved as PNG format


![race_piechart.png](attachment:race_piechart.png)

In [34]:
*We can examine the distribution of the variable by using a box plot
graph box bwt, ///
    title("Box plot for birthweight of child")
graph export "bwt_boxplot.png", replace width(3400)



file C:/Users/acade/.stata_kernel_cache/graph4.svg saved as SVG format
file C:/Users/acade/.stata_kernel_cache/graph4.pdf saved as PDF format

(file bwt_boxplot.png not found)
file bwt_boxplot.png saved as PNG format


![bwt_boxplot.png](attachment:bwt_boxplot.png)

In [35]:
summ bwt


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         bwt |        189    2944.286     729.016        709       4990


<b>Q: What do we see? What is the distribution of the variable?</b>

The mean of the variable is around 3,000 grams. About 50% of the sample is between 2,500 to 3,500 grams. It has an outlier, but seems mostly normally distributed.

In [39]:
*How does the distribution vary across race? This is more bivariate, but I want to show you.
graph box bwt, over(race) ///
    title("Box plot for birthweight of child" "across racial/ethnic identity")
graph export "bwt_race_boxplot.png", replace width(3400)



file C:/Users/acade/.stata_kernel_cache/graph8.svg saved as SVG format
file C:/Users/acade/.stata_kernel_cache/graph8.pdf saved as PDF format

file bwt_race_boxplot.png saved as PNG format


![bwt_race_boxplot-2.png](attachment:bwt_race_boxplot-2.png)

In [42]:
*Descriptive statistics of age
summ age


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |        189     23.2381    5.298678         14         45


In [45]:
*We can plot histrogram of age
hist age, frequency
graph export "age_hist.png", replace width(3400)


(bin=13, start=14, width=2.3846154)

file C:/Users/acade/.stata_kernel_cache/graph12.svg saved as SVG format
file C:/Users/acade/.stata_kernel_cache/graph12.pdf saved as PDF format

file age_hist.png saved as PNG format


![age_hist.png](attachment:age_hist.png)

# Modeling and regression
In social statistics, we are using variables to understand social issues or social patterns. We are using modeling to understand these important topics. The following things are important when modeling:
* Dependent variable - the variable we are interested in explaining
* Independent variable - the predictor variable, the variable we want to show is important in explaining the dependent variable.
* Shape of the relationship - linear, curvilinear, exponential, etc...
* Direction of the relationship - positive, negative, zero
* Strengthen of the relationship - statistically significant

We model relationships in society to explain/evaluate social theories. We use hypothesis testing and regression to examine those models or relationships.

### Simple Linear Regression
<center>$Y_{i} = \beta_{0} + \beta_{1}(x_{i}) + e_{0i}$ 
    
where

$Y$ is dependent/outcome variable, the variable you are interested in explaining

$i$ is the observation

$\beta_{0}$ is the intercept (or constant), usually the overall mean

$\beta_{1}$ is the independent variable or predictor vaariable

$x_{i}$ is the value of $\beta_{1}$ for observation $i$
    
$e_{0i}$ is the error of the prediction or residual between the actual value and predicted value

One way to estimate a linear regression in social statistics is through ordinary least-squares (OLS). OLS takes the differences among the observations and mean. 

In [14]:
*For example, we can tell Stata to do a simple OLS. First let's plot the values between age and birthweight.
scatter bwt age
graph export "bwt_age_scatter.png", replace width(3400)



file C:/Users/acade/.stata_kernel_cache/graph1.svg saved as SVG format
file C:/Users/acade/.stata_kernel_cache/graph1.pdf saved as PDF format

file bwt_age_scatter.png saved as PNG format


![bwt_age_scatter.png](attachment:bwt_age_scatter.png)

In [15]:
*This is basic OLS command
regress bwt age
*This makes a yhat variable where yhat = 2658.12 + 12.34 * (age of observation)
predict yhat



      Source |       SS           df       MS      Number of obs   =       189
-------------+----------------------------------   F(1, 187)       =      1.51
       Model |  800428.169         1  800428.169   Prob > F        =    0.2207
    Residual |  99114870.4       187  530026.045   R-squared       =    0.0080
-------------+----------------------------------   Adj R-squared   =    0.0027
       Total |  99915298.6       188  531464.354   Root MSE        =    728.03

------------------------------------------------------------------------------
         bwt | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   12.31444   10.02079     1.23   0.221     -7.45389    32.08277
       _cons |   2658.122   238.8097    11.13   0.000     2187.014    3129.229
------------------------------------------------------------------------------

(option xb assumed; fitted values)


<center>$(birthweight)_{i} = \beta_{0} + \beta_{1}(age_{i})$ 
    
<center>$(birthweight)_{i} = 2658.12 + 12.31(age_{i})$ 

$\beta_{0}$ is the intercept represents the overall average of birthweight when age is 0.

$\beta_{1}$ is the slope between age and birthweight. Here it is positive slope meaning as age increases there is a corresponding increase in birthweight. $\beta_{1}$ is not statistically significant (p<.05).

In [16]:
*Let's show the observations and the OLS regression line
twoway (scatter bwt age) ///
    (line yhat age)
graph export "bwt_age_toway.png", replace width(3400)



file C:/Users/acade/.stata_kernel_cache/graph2.svg saved as SVG format
file C:/Users/acade/.stata_kernel_cache/graph2.pdf saved as PDF format

file bwt_age_toway.png saved as PNG format


![bwt_age_toway.png](attachment:bwt_age_toway.png)

We can have other types of relationships within our regression analyses. One commmon type is curvilinear relationships. The way to input these relationships into your regression is simiply by transforming the variable.

For example, in environmental sociology it is common to add a quadratic term of GDP per capita.


<center>$(UrbanPercent)_{i} = \beta_{0} + \beta_{1}(GDP per capita_{i}) + \beta_{1}(GDP per capita^{2}_{i})$ 

![Fig_scatter.png](attachment:Fig_scatter.png)

![Fig_linearline.png](attachment:Fig_linearline.png)

![Fig_quad.png](attachment:Fig_quad.png)