# Workbook #1: Data carpentry
Welcome to the first workbook of SOCG 206. This workbook will cover basic tasks within data carpentry. These are tasks I use a lot within my own work. The <b>learning outcomes</b> for this workbook:
* Students will be able to identify essential Stata commands and features of Stata.
* Students will be able to clean basic excel/csv files using Stata.
* Students will be able to define variable and differential between types of variables in theory and coding.
* Student will be able to convert variables between datatypes in Stata.

## Getting our workstations set-up
I teach SOCG 206 using **Stata**. This is the program I use to do my published sociological work. 

### UCSD a campus license for **Stata** for students
Please get access to [**Stata**](https://blink.ucsd.edu/technology/computers/software-acms/available-software/stata.html).

### Jupyter Notebook, Anaconda, and GitHub
When teaching any sort of quantitative or coding, I use **Jupyter Notebooks** that I publish on *GitHub*. This is a great way to share notes and code. I recommend using these tools.

*Jupyter Notebook* (.ipynb) is the file you are currently seeing now. These notebooks can save notes, codes, and output all in one place. They are so helpful in teaching and project note-taking.

I use *Jupyter Notebook* through *Anaconda*. *Anaconda* is a **Python** platform and uses includes *Jupyter* Notebook. Download [*Anaconda*](https://www.anaconda.com/docs/getting-started/anaconda/install). Usually, *Anaconda* comes with *Jupyter Notebook* but here is information about downloading [*Juypter*](https://jupyter.org/install).

*Jupyter Notebook* uses **Python** by default. However, you can download [**Stata**](https://kylebarron.dev/stata_kernel/) and [**R**](https://github.com/IRkernel/IRkernel) kernels. You must have **Stata** and **R** already on your computer to use the kernels.

[*GitHub*](https://github.com/) is a site where a lot of us publish teaching and research codes. I recommend making an account.

<b>Q: Have you downloaded **Stata**? Are you interested in using *Jupyter Notebook* and *GitHub*?</b>

 <img src="http://cdn.onlinewebfonts.com/svg/img_454217.png" width=100 height=100 />

## Organization tips
<i>"Tidy a little a day and you'll be tidying forever"</i> -Marie Kondo

It is <u>so</u> important to keep your quantitative work organized. These are some tips that have helped me stay organized.

* Make class/project/homework specific folders. Use basic universal commands to redirect your work to that specific folder.
* Label files by date "homework1 1 31 25"
* I highly recommend using code when possible. Codes are a written record of what you did. I almost always have to go back and see what I did or re-run analyses. In the beginning I would have to spend a lot of time re-making code because I did not save a code file. Now, I consistently make a do-file or a Jupyter Notebook and these codes save me!
* Write notes in your code file.

<b>Q: What is your current organization strategy? How does it work to steamline your workflow? What difficulties do you face in your current organization strategy? </b>

<div class="alert alert-block alert-warning">Stata is a powerful program and has a lot features to it. If you are new to Stata, please refer for 
- Longest, K. C. (2020). Using stata® for quantitative analysis - third edition. SAGE Publications, Inc. https://www.doi.org/10.4135/9781544318547 (it is avaliable as an e-book at the library.) </div>

 <img src="https://cdn1.iconfinder.com/data/icons/development-2-webby/60/52_-Script-_development_coding_programming_code-512.png" width=200 height=200 /> 

## Basic universal commands

The following commands are universal in many coding languages including **Stata**, **Python**, and **R**. More importantly, these commands help us stay organized within the memory space of the computer.

```pwd```

This stands for "print working directory". It tells you where in your computer you are working on.

In [1]:
pwd

D:\documents copy\teaching\SOCG 206 spring 2025\jupyter


```cd file_path```

This means "change directory." You can change your current working directory with cd. <b>Use parenthesis for file paths!</b>

In [2]:
cd "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\workbook1"

D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\workbook1


I recommend always starting your coding sessions with changing your working directory to the folder your project file. See how this folder is placed in the hierarchy of my work?

## Essential **Stata** commands

The following commands are essential commands for Stata.

### OPENING files
This commands opens .dta files. It is important to include the ```, clear``` at the end because this clear out any current dataset open in **Stata**. **Stata** can only handle one dataset open at a time. By habit, I usually put parenthesis around file names and website links. Parenthesis tell the computer that this is a string.

```use file_path_name, clear```

In [1]:
*example of opening .dta file from the web. This also will work for a file path.
use "https://www.stata-press.com/data/r18/lifeexp.dta", clear

(Life expectancy, 1998)


### OPENING NON-STATA FILES
Sometimes we need to open excel or csv (comma-separated vector) files. Usually to open these files, you use a different set of commands.

#### Opening excel files
Firstrow means the first row is the variable names. You can specify the sheet name or omit it if there is only one sheet. 

````import excel file_name, firstrow sheet(name) clear````

### Opening csv files
````import delimited file_name, clear````

In [12]:
*First, you must read in the file from the web.
import excel "https://ers.usda.gov/sites/default/files/_laserfiche/DataFiles/53251/Ruralurbancontinuumcodes2023.xlsx?v=10526", firstrow clear
describe
list in 1/3




Contains data
  obs:         3,235                          
 vars:             6                          
 size:       436,725                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
FIPS            str5    %9s                   FIPS
State           str2    %9s                   State
County_Name     str46   %46s                  County_Name
Population_2020 long    %10.0gc               Population_2020
RUCC_2023       byte    %10.0g                RUCC_2023
Description     str77   %77s                  Description
--------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.


     +-------------------------------------------------------------------------+
  1. |   FIPS   

In [13]:
*First, you must read in the file from the web.
import delimited "https://ers.usda.gov/sites/default/files/_laserfiche/DataFiles/53251/Ruralurbancontinuumcodes2023.csv?v=92956", clear
describe
list in 1/3


(5 vars, 9,703 obs)


Contains data
  obs:         9,703                          
 vars:             5                          
 size:     1,397,232                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
fips            long    %8.0g                 FIPS
state           str2    %9s                   State
county_name     str46   %46s                  County_Name
attribute       str15   %15s                  Attribute
value           str77   %77s                  Value
--------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.


     +--------------------------------------------------------------------+
  1. |   fips   |   state    |      county_name    |         attribut

### SAVING 
This commands saves .dta files. Important to remember to include ```, replace``` otherwise Stata wont allow it to be save over.

```save file_path_name, replace```

In [6]:
save "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data.dta", replace

(note: file D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data.dta not
>  found)
file D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data.dta saved


### Saving files as excel or csv
The command above saves files as .dta. However, you can save in other data formats.

<b>Excel</b>

```export excel file_name, replace```

<b>CSV</b>

```export delimited file_name,replace```

In [21]:
export excel "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data.xlsx", replace

file D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data.xlsx saved


In [22]:
export delimited "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data.csv", replace

(note: file D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data.csv not
>  found)
file D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data.csv saved


Go check your folders to make sure it saved.

### COMMENTING
Use asterisk * of double backslash // to write single-line comments. Use /* */ to write multi-line comments.

In [5]:
*Use the asterisk sign * or // to write single-line commands.
/* Use asterisk with a backslash and asterisk to 
multi-line comments */ 

### HELP
Type ```help``` in front of the command for more information about the command. I always use the help command because I forget coding syntax. I look at the examples or the Stata manual for help. If that doesn't help then I will copy and paste commands to Google

In [7]:
help merge

## Types of Stata files

Successful quantitive projects rely on these files...
- <b>.do files</b> - these files store commands and comments.
- <b>.dta files</b> - these files store data.
- <b>.smcl files</b> - these files results or the output. I usually like to conver these into a pdf (I'll show you how to do this later)

In [2]:
*setting up working directory to be my workbook 1 folder
cd "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\workbook1"

*open a log file
log using "week2 inclass practice.smcl", replace


D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\workbook1

--------------------------------------------------------------------------------
      name:  <unnamed>
       log:  D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\workbook1\w
> eek2 inclass practice.smcl
  log type:  smcl
 opened on:  10 Apr 2025, 10:19:41


<b>What would have happened if we used the <i>log using "week1 inclass practice.smcl", replace</i> code and DID NOT change directory? Where would the file be save at?</b>

In [3]:
*opening datafile
use "https://www.stata-press.com/data/r18/lifeexp.dta", clear

*tell stata to describe data
desc


(Life expectancy, 1998)


Contains data from https://www.stata-press.com/data/r18/lifeexp.dta
  obs:            68                          Life expectancy, 1998
 vars:             6                          26 Mar 2022 09:40
 size:         2,652                          (_dta has notes)
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
region          byte    %16.0g     region     Region
country         str28   %28s                  Country
popgrowth       float   %9.0g               * Avg. annual % growth
lexp            byte    %9.0g               * Life expectancy at birth
gnppc           float   %9.0g               * GNP per capita
safewater       byte    %9.0g               * Safe water
                                            * indicated variables have notes
------

All of the commands and output will be saved into the log. Log files are files, but they can only be open through **Stata**. I prefer to save my log files as pdf because I can read and share them easily.

```capture log close```

```translate log_file_name pdf_file_name```

In [4]:
*This will close and save log file and convert to pdf
capture log close
translate "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\workbook1\week2 inclass practice.smcl" ///
    "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\workbook1\week2 inclass practice.pdf", replace



(file D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\workbook1\week2 in
> class practice.pdf written in PDF format)


In [2]:
*type edit and dataview file will open
edit

<img src="https://healthitanalytics.com/images/site/features/_normal/ThinkstockPhotos-645261596.jpg" width=300 height=300 />

## Variables in social statistics
When you hear the word variable--what comes to mind?

In social statistics, we generally have two kinds of variables:
* Categorical variables are variables similar to discrete variables, where the values are categories/boundaries. For example, race or state (Alabama, Oregon, etc...). 
* Numeric variables are variables similar to contiunous variables, where the values have numeric value to them. For example, percent of Latinx residents or birthweight of child. 

Categorical variables need to handle with more care as compared to numeric variables. For example, you cannot find the mean. You can find the mode of categorical variables. 

You can do a lot of statistics with numeric variables because you can treat them like numbers. You can find the mean and standard deviation. Just keep in mind and always ask yourself...what does it mean?

## Variables in coding and **Stata** code
It is important to understand how variables are handle in coding and **Stata** code. Here we will cover variable types and storage types.

<img src="https://static.semrush.com/blog/uploads/media/cd/34/cd34e2cb04a60d0d027c033e64591477/types-of-content-marketing.svg" width=300 height=300 />

### Variable types
Each coding language has it own datatypes. Generally, there is are numeric (have actual number meaning) and string (have meaning in characters and texts). Keep in mind that each coding language has more specific datatypes. Let's review the **Stata** datatypes:
* Numeric variables -- These variables are numbers (similar to previous definition). You can do calculations with numeric variables like mean or standard deviation. In data view, numeric variables are displayed in black text.
* String variables -- These variables are characters or text. String can be in double quotes. In data view, string variables are displayed in red text.
* Numeric variables with string labels -- These are special Stata-only variable that are denoted in blue in the data viewer. They are saved as numeric variable and have string label attached to them. You can manually add labels or get this by using the encode command (discussed below).

<img src="https://ophtek.com/wp-content/uploads/2018/04/data-storage.jpg" width=300 height=300 />

### Storage types
Data takes space or storage on your computer. Each coding language has different space formats. **Stata**'s data format is:

| Storage type | Min | Max | Closest to 0 without being 0 | Bytes |
| --- | --- | --- | --- | --- |
| byte | -127 | 100 | +/-1 | 1 |
| int | -32,767 | 32,740 | +/-1 | 2 |
| long | -2,147,483,647 | 2,147,483,620 | +/-1 | 3 |
| float | -1.70141173319 x 10^38 | 1.70141173319 x 10^38 | +/-10^-38 | 4 |
| double | -8.9884656743 x 10^307 | +8.9884656743 x 10^37 | +/-10^-323 | 5 |

<i>Don't confuse integer in the numeric sense with the "integer" storage type in **Stata**. **Stata** also recognizes time.

String variables are stored as str1, str2, ..., str2045, and strL. Where the number after "str" indicates the length of the string variable.

<B>IT IS SO IMPORTANT TO KNOW YOUR VARIABLE TYPES AND STORAGE TYPES. SOME CODES ONLY WORK FOR SPECIFIC DATA AND STORAGE TYPES.

In [1]:
use "https://www.stata-press.com/data/r17/census12.dta", clear

*the describe command will give you summary of the data type of your variables
describe


(1980 Census data by state)


Contains data from https://www.stata-press.com/data/r17/census12.dta
  obs:            50                          1980 Census data by state
 vars:             7                          6 Apr 2020 15:43
 size:         1,950                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
state           str14   %14s                  State
state2          str2    %-2s                  Two-letter state abbreviation
region          str7    %9s                   Census region
pop             long    %10.0g                Population
median_age      float   %9.2f                 Median age
marriage_rate   float   %9.0g                 
divorce_rate    float   %9.0g                 
----------------------------------------------------------

In [2]:
list in 1/3


     +-----------------------------------------------------------------+
  1. |         state | state2 | region |     pop | median~e | marria~e |
     | Massachusetts | MA     |     NE | 5737037 |    31.20 | .0080657 |
     |-----------------------------------------------------------------|
     |                            divorc~e                             |
     |                            .0031154                             |
     +-----------------------------------------------------------------+

     +-----------------------------------------------------------------+
  2. |         state | state2 | region |     pop | median~e | marria~e |
     |  Rhode Island | RI     |     NE |  947154 |    31.80 | .0079079 |
     |-----------------------------------------------------------------|
     |                            divorc~e                             |
     |                            .0038072                             |
     +-------------------------------------------

## Switching between variables types
Sometimes the data we get is messy and we have clean it before we can even calculate mean.

### ENCODE
```encode``` makes a string variable into a numeric. For example, let's say we have survey data with a question of "Are you a smoker" Yes/No. ```encode``` will create a new variable where "Yes" gets a value for it and "No" gets a value for it. When a variable has been "encoded" it is displayed in blue text.

```encode string_var, gen(new_var_name) label(label_name)```

In [4]:
*This opens the data
use https://www.stata-press.com/data/r17/hbp2.dta, clear
*This code prints a descriptive of the dataset.
desc




Contains data from https://www.stata-press.com/data/r17/hbp2.dta
  obs:         1,130                          
 vars:             7                          3 Mar 2020 06:47
 size:        24,860                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
id              str10   %10s                  Record identification number
city            byte    %8.0g                 City
year            int     %8.0g                 Year
age_grp         byte    %8.0g      agefmt     Age group
race            byte    %8.0g      racefmt    Race
hbp             byte    %8.0g      yn         High blood pressure
sex             str6    %9s                   Sex
--------------------------------------------------------------------------------
Sorted by: 


In [5]:
list in 1/5


     +-----------------------------------------------------------+
     |         id   city   year   age_grp    race   hbp      sex |
     |-----------------------------------------------------------|
  1. | 8008238923      1   1993     15–19   Black    No   female |
  2. | 8007143470      1   1992     30–34       .    No          |
  3. | 8000468015      1   1988     25–29   Black    No     male |
  4. | 8006167153      1   1991     25–29   Black    No     male |
  5. | 8006142590      1   1991     20–24   Black    No   female |
     +-----------------------------------------------------------+


In [7]:
*Stata wont let you run statistics with string variables. See error message
regress year i.sex

sex:  string variables may not be used as factor variables


r(109);





<b>The variable <i>sex</i> is a string datatype.

In [8]:
*This is an example of using encode
encode sex, gen(sex_numeric) label("Respondent's sex (numeric)")

In [8]:
*Let's see if there was a change
list in 1/5
desc



     +----------------------------------------------------------------------+
     |         id   city   year   age_grp    race   hbp      sex   sex_nu~c |
     |----------------------------------------------------------------------|
  1. | 8008238923      1   1993     15–19   Black    No   female     female |
  2. | 8007143470      1   1992     30–34       .    No                   . |
  3. | 8000468015      1   1988     25–29   Black    No     male       male |
  4. | 8006167153      1   1991     25–29   Black    No     male       male |
  5. | 8006142590      1   1991     20–24   Black    No   female     female |
     +----------------------------------------------------------------------+


Contains data from https://www.stata-press.com/data/r17/hbp2.dta
 Observations:         1,130                  
    Variables:             8                  3 Mar 2020 06:47
--------------------------------------------------------------------------------
Variable      Storage   Display    Val

In [10]:
codebook sex_numeric


--------------------------------------------------------------------------------
sex_numeric                                                                  Sex
--------------------------------------------------------------------------------

                  Type: Numeric (long)
                 Label: Respondent's sex (numeric), but label does not exist

                 Range: [1,2]                         Units: 1
         Unique values: 2                         Missing .: 2/1,130

            Tabulation: Freq.  Value
                          433  1
                          695  2
                            2  .


In [9]:
*Now, we can run statistics
regress year i.sex_numeric


      Source |       SS           df       MS      Number of obs   =     1,128
-------------+----------------------------------   F(1, 1126)      =      1.50
       Model |   3.0599326         1   3.0599326   Prob > F        =    0.2207
    Residual |  2294.78315     1,126  2.03799569   R-squared       =    0.0013
-------------+----------------------------------   Adj R-squared   =    0.0004
       Total |  2297.84309     1,127  2.03890247   Root MSE        =    1.4276

------------------------------------------------------------------------------
        year |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 sex_numeric |
       male  |  -.1070962   .0874017    -1.23   0.221    -.2785847    .0643923
       _cons |   1991.196   .0686053  2.9e+04   0.000     1991.062    1991.331
------------------------------------------------------------------------------


In [11]:
codebook race


--------------------------------------------------------------------------------
race                                                                        Race
--------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: racefmt

                 Range: [1,3]                         Units: 1
         Unique values: 3                         Missing .: 4/1,130

            Tabulation: Freq.   Numeric  Label
                          196         1  White
                          773         2  Black
                          157         3  Hispanic
                            4         .  


<b> Now, the variable sex_numeric is numeric data type with string labels so **Stata** can run analyses. Race is similar because they are denoted by blue text and in the describe printout you can see their storage type is numeric.

### DESTRING
```destring``` converts a variable from a string to numeric variable. This only works if the string variable ONLY has numbers. Sometimes when you import data, **Stata** reads it as string because there is a non-numeric character in the variable. You have to observe the data and include removing that character from the data.

```destring var, gen(new_name)```

```destring var, replace```

```destring var, gen(new_name) ignore(character)```

In [10]:
*Reading in data and asking for description of data
use http://www.stata-press.com/data/r13/destring2, clear
desc
list in 1/5




Contains data from http://www.stata-press.com/data/r13/destring2.dta
  obs:            10                          
 vars:             3                          3 Mar 2013 22:50
 size:           280                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
date            str14   %10s                  
price           str11   %11s                  
percent         str3    %9s                   
--------------------------------------------------------------------------------
Sorted by: 


     +------------------------------------+
     |       date         price   percent |
     |------------------------------------|
  1. | 1999 12 10     $2,343.68       34% |
  2. | 2000 07 08     $7,233.44       86% |
  3. | 1997 03 02    $12,442.89       12% |
  4. | 

<b>All variable in this dataset are string</b>

In [14]:
destring price, generate(price3)

price: contains nonnumeric characters; no generate


In [15]:
/*Code for destring
You could use either one of these codes. Note that the second one REPLACES the string variable with the numeric variable.*/
destring date price percent, generate(date2 price2 percent2) ignore("$ ,%")

*destring date price percent, ignore("$ ,%") replace

date: character space removed; date2 generated as long
price: characters $ , removed; price2 generated as double
percent: character % removed; percent2 generated as byte


In [6]:
*Let's make sure it worked.
desc
list in 1/5



Contains data from http://www.stata-press.com/data/r13/destring2.dta
 Observations:            10                  
    Variables:             6                  3 Mar 2013 22:50
--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
date            str14   %10s                  
date2           long    %10.0g                
price           str11   %11s                  
price2          double  %10.0g                
percent         str3    %9s                   
percent2        byte    %10.0g                
--------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.


     +----------------------------------------------------------------------+
     |       date      date2         price      pri

# Cleaning a real data
You want to examine racial and ethnic breakdown of incarceration and compare them to the general population. Go to the [Prison Policy Initative site](https://www.prisonpolicy.org/data/). We want to Incarcerated populations by race/ethnicity and gender for each state excel. You find this excel file of data for 2010. Unfortunately, when you read the file into **Stata**, it all messy. Your job is clean it so we can use the data. First, I open the excel and do a visual test. Let's make a list of what needs to be cleaned....
* rename variable name
* Row 5-58 has the data
* destring the numeric values

<b>This is where code becomes very handy--we have our code in case we mess up! Make sure to have a do-file.</b>

In [8]:
*First, you must read in the file from the web.
import excel "https://www.prisonpolicy.org/data/race_ethnicity_gender_2010.xlsx", ///
    sheet(Total) clear
describe




Contains data
  obs:            61                          
 vars:            35                          
 size:       130,540                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
A               str63   %63s                  
B               str6    %9s                   
C               str20   %20s                  
D               str50   %50s                  
E               str51   %51s                  
F               str71   %71s                  
G               str79   %79s                  
H               str51   %51s                  
I               str88   %88s                  
J               str61   %61s                  
K               str57   %57s                  
L               str58   %58s                  
M             

In [2]:
list A C E in 1/6 
list A C E in 58/61



     +---------------------------------------------------------------------+
  1. |                                                                 A   |
     |   Incarcerated population by race/ethnicity in Census 2010- total   |
     |---------------------------------------------------------------------|
     |             C |                                                   E |
     |               |                                                     |
     +---------------------------------------------------------------------+

     +---------------------------------------------------------------------+
  2. |                                                                 A   |
     |                                                                     |
     |---------------------------------------------------------------------|
     |             C |                                                   E |
     |               |                                                   

Which rows need to be omitted?

### DROP & KEEP
To remove rows or columns from dataset, we can use ```drop``` or ```keep```. You can also add conditional statements.

To drop rows use:

```drop in rownumber/rownumber```

```keep in rownumber/rownumber```

To drop columns use:  

```drop varname```

```keep varname```

To drop based on conditional statements:

```drop if condition```

```keep if condition```

In [9]:
*Second, you drop the observations or rows that are not necessary.
drop in 59/61
drop in 1/4


(3 observations deleted)

(4 observations deleted)


<b> instead of ```drop```, we could have used ```keep```<b>
    
```keep in 5/58```

In [50]:
list in 1/2


     +-------------------------------------------------------------------------+
  1. |               A       |            B       |                   C        |
     |           GEOID       |       GEOID2       |           Geography        |
     |-------------------------------------------------------------------------|
     |                                                          D              |
     |              Total : In Correctional Facilities for Adults              |
     |-------------------------------------------------------------------------|
     |                                                             E           |
     |           White alone : in Correctional Facilities for Adults           |
     |-------------------------------------------------------------------------|
     |                                                                       F |
     | Black or African American alone : in Correctional Facilities for Adults |
     |---------------------

Which variables to keep? We want race and ethnicity variables.

### RENAME
```rename``` lets us rename variables in **Stata**. We want meaningful names.

```rename varname new_varname```

In [10]:
*Fourth, you need to rename the variables with useful names
rename A geoid
rename B geoid2
rename C state
rename D tot_incar
rename E wht_incar
rename F blk_incar
rename G indig_incar
rename H asian_incar
rename I hawpi_incar
rename J other_incar
rename K multirace_incar
rename L lat_incar

In [11]:
*Row 1 just has the variable names so you can drop it now that you are done cleaning
drop in 1

(1 observation deleted)


In [6]:
list in 1/3


     +-----------------------------------------------------------------------+
  1. |       geoid | geoid2 |         state | tot_in~r | wht_in~r | blk_in~r |
     |   0100000US |        | United States |  2263602 |  1139749 |   897875 |
     |-----------------------------------------------------------------------|
     | indig_~r  | asian_~r  | hawpi_~r  | other_~r  | multir~r  | lat_in~r  |
     |    37854  |    16928  |     5494  |   142908  |    22794  |   419509  |
     |-----------------------------------------------------------+-----------|
     |      M | N |         O |         P |        Q  |       R  |        S  |
     | 885956 |   | 308745538 | 223553265 | 38929319  | 2932248  | 14674252  |
     |-----------------------------------------------------------------------|
     |       T  |         U  |        V  |         W   |          X   |  Y   |
     |  540013  |  19107368  |  9009073  |  50477594   |  196817552   |      |
     |-----------------------+---------------------

Need to only keep certain variables.

In [12]:
drop M-AI
desc




Contains data
  obs:            53                          
 vars:            12                          
 size:        34,715                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
geoid           str63   %63s                  
geoid2          str6    %9s                   
state           str20   %20s                  
tot_incar       str50   %50s                  
wht_incar       str51   %51s                  
blk_incar       str71   %71s                  
indig_incar     str79   %79s                  
asian_incar     str51   %51s                  
hawpi_incar     str88   %88s                  
other_incar     str61   %61s                  
multirace_incar str57   %57s                  
lat_incar       str58   %58s                  
--------------

In [13]:
*This is one way to transform the string variables into numeric variables.
destring geoid2, replace
destring tot_incar, replace
destring wht_incar, replace
destring blk_incar, replace
destring indig_incar, replace
destring asian_incar, replace
destring hawpi_incar, replace
destring other_incar, replace
destring multirace_incar, replace
destring lat_incar, replace
describe
*You will know it worked, if in data view, the variables are displayed in black text color.


geoid2: all characters numeric; replaced as byte
(1 missing value generated)

tot_incar: all characters numeric; replaced as long

wht_incar: all characters numeric; replaced as long

blk_incar: all characters numeric; replaced as long

indig_incar: all characters numeric; replaced as long

asian_incar: all characters numeric; replaced as int

hawpi_incar: all characters numeric; replaced as int

other_incar: all characters numeric; replaced as long

multirace_incar: all characters numeric; replaced as int

lat_incar: all characters numeric; replaced as long


Contains data
  obs:            53                          
 vars:            12                          
 size:         6,042                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
geoid        

<b> Shortcut </b>

You can use the - to list variables that are next to each other in the dataset. For example, instead of writing each incarceration variable. We can use the dash (-) to make the code shorter.

````destring tot_incar-lat_incar, replace````

In [14]:
*Next, this is one way to make the string variable of "state" into a numeric variable
encode state, gen(state_num) label("State (numeric variable)")
*You will know if worked if state is in blue text color.

In [15]:
*Let's make sure it worked
desc


Contains data
  obs:            53                          
 vars:            13                          
 size:         6,254                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
geoid           str63   %63s                  
geoid2          byte    %10.0g                
state           str20   %20s                  
tot_incar       long    %10.0g                
wht_incar       long    %10.0g                
blk_incar       long    %10.0g                
indig_incar     long    %10.0g                
asian_incar     int     %10.0g                
hawpi_incar     int     %10.0g                
other_incar     long    %10.0g                
multirace_incar int     %10.0g                
lat_incar       long    %10.0g                
state_num       

### GENERATE

In Stata, you can make (generate) a new variable using the ```generate``` command.

The dataset reports raw counts of incarcerated individuals. Given the range of populations across states. It is useful to compare percentages. So I want to generate the following variables.

### SUMMARIZE

```summarize varname```

This command gives basic descriptive statistics of a specified variable.

In [16]:
generate whtincar_per=(100*wht_incar)/tot_incar
generate blkincar_per=(100*blk_incar)/tot_incar
generate indigincar_per=(100*indig_incar)/tot_incar
generate latincar_per=(100*lat_incar)/tot_incar
generate asianincar_per=(100*asian_incar)/tot_incar
generate hawpincar_per=(100*hawpi_incar)/tot_incar
generate otherincar_per=(100*other_incar)/tot_incar
generate multincar_per=(100*multirace_incar)/tot_incar

summarize whtincar_per-multincar_per











    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
whtincar_per |         53    55.42296    17.80663   5.669816   89.83419
blkincar_per |         53    32.74913    20.98107   2.716373   87.43746
indigincar~r |         53    4.141851    7.938771   .0929224   37.54161
latincar_per |         53     15.2114     16.0398    1.87551   97.62694
asianincar~r |         53    1.010316    2.884845   .0847523   21.29385
-------------+---------------------------------------------------------
hawpincar_~r |         53    .8842754    5.281817          0   38.55103
otherincar~r |         53     4.30683    4.769431   .4042975   21.18251
multincar_~r |         53    1.484637     1.59199   .3716896   9.272468


### FOREACH loop
In coding you can use loops for repeative tasks. If you want to repeative tasks across variables, you can use the foreach loop. For example, we can shortcut the previous tasks to a single loop below.

```foreach x of varlist varname { ... `x' ... }```

In [52]:
foreach x of varlist wht_incar-lat_incar {
    gen `x'per=(100*`x')/tot_incar
    }
describe

### RECODE
Recode is a variable that will recode a numeric variable based on certain rules.

``` recode v1 (3=0) (4=-1) (5=-2), generate(newv1)```

In [17]:
gen eparegion=geoid2
recode eparegion ( 9 23 25 33 44 50 =1) ( 34 36 72 =2) ///
    ( 10 11 24 42 51 54= 3) ( 1 12 13 21 28 37 45 47=4) ///
    ( 17 18 26 27 39 55=5) ( 5 22 35 40 48=6) ( 19 20 29 31=7) ///
    ( 8 30 38 46 49 56=8) ( 4 6 15 32=9) ( 16 41 53 =10)


(1 missing value generated)

(eparegion: 50 changes made)


### FORVALUES loop
The ```forvalues``` is a loop where you can specify specific values of a variable. 

```forvalues i = range { ... }```

For example we can write summary statistics for each EPA region

In [62]:
forvalues region = 1/10 {
      display `region'
      summarize whtincar_per blkincar_per latincar_per if eparegion == `region'
  }


1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
whtincar_per |          6    69.51284    22.18784   35.43547   89.83419
blkincar_per |          6    19.92341    14.49321   6.632237    40.8146
latincar_per |          6    15.07152    11.56767    1.87551   28.54579
2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
whtincar_per |          3    38.63895    3.935556   35.58852   43.08131
blkincar_per |          3    38.16316    26.85431   7.156443   53.96915
latincar_per |          3    16.37803    8.936492   6.062768   21.77512
3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
whtincar_per |          6    37.51135     20.0496   5.669816   67.63306
blkincar_per |          6    57.77237    20.07656   28

In [18]:
*Let's save our data
cd "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data"
save "incarceration 4 14 25.dta", replace


D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data

file incarceration2021 4 14 25.dta saved


## Getting Census data
The *Census* is a popular secondary data source in sociology. In this workbook, we will use NHGIS to download census data and clean it in **Stata**.

<a id="https://www.nhgis.org/">NHGIS</a> is a data hub for census data over time. They have sorted the data made it easy to download. If you use NHGIS data, you must cite them (the citation information is found on their website). You need to make an account in order to use the site. Once you are logged in, click "Select Data" in the top right hand corner.

For census data, there are two main dataset:
### Census
The Census reports data every ten years. They only report race, ethnicity, and home renter/owner. Note that how the census defines race and ethnicity HAS CHANGED over time. If possible, use the Census data. Census data is more reliable as compared to ACS.

### American Community Survey
The American Community Survey (ACS) is report more frequently than the Census. The ACS also report MORE information than the census. ACS reports income, family structures, educational attainment, etc... The ACS has different waves of data: 5-year wave and 3-year wave. Generally census tracts and census blocks are reported ONLY at the five-year wave. Counties and states are reported at the 3-year wave. Generally, the smaller geographic areas are less reliable. For example census tracts are more reliable than census blocks.

There are four green tabs:
#### Geographic Levels
You can chose a variety of level of analysis such as census tract, county, state, etc..
#### Years
You can pick years or waves
#### Topics
There are many topics you can choose from
#### OR AND
You can choose "or" and "and" options

### Linking Prison Policy Intiative data with Census data
Let's say we want to compare the incarcerated percent with the states. We can easily get this census data from NHGIS. 

We can download census data from the [NHGIS](https://www.nhgis.org/). Make a account and download data at the state level for year 2020 for variable: "Hispanic or Latino Origin by Race"

Once the file is ready, you can download it. You have to unzip the file.

Once the file is unzipped, there is the .csv (data) and .txt (codebook) file.

You must look at the codebook to figure out what variables. The geographic variables will make up unqiue ID.

In [4]:
*read the csv file; make sure to include clear
import delimited "D:\documents copy\research\Practice\extract\nhgis\nhgis0138_csv\nhgis0138_ds172_2010_state.csv", clear
describe


(74 vars, 52 obs)


Contains data
  obs:            52                          
 vars:            74                          
 size:         9,672                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
gisjoin         str4    %9s                   GISJOIN
year            int     %8.0g                 YEAR
stusab          str2    %9s                   STUSAB
regiona         byte    %8.0g                 REGIONA
divisiona       byte    %8.0g                 DIVISIONA
state           str20   %20s                  STATE
statea          byte    %8.0g                 STATEA
countya         byte    %8.0g                 COUNTYA
cousuba         byte    %8.0g                 COUSUBA
cousubcc        byte    %8.0g                 COUSUBCC
placea          byte    

In [5]:
*Make sure to review the codebook to get a breakdown of the variables
rename h7z001 tot
rename h7z002 nt_lat
rename h7z003 wht_ntl
rename h7z004 blk_ntl
rename h7z005 ami_ntl
rename h7z006 asn_ntl
rename h7z007 hawpac_ntl
rename h7z008 other_ntl
rename h7z009 multirace_ntl
rename h7z010 lat
rename h7z011 wht
rename h7z012 blk
rename h7z013 ami
rename h7z014 asn
rename h7z015 hawpac
rename h7z016 other
rename h7z017 multirace

In [6]:
*We could also do a foreach loop
gen latper=100*(lat/tot)
gen whtper=100*((wht_ntl+wht)/tot)
gen blkper=100*((blk_ntl+blk)/tot)
gen amiper=100*((ami_ntl+ami)/tot)
gen asnper=100*((asn_ntl+asn)/tot)
gen hawpacper=100*((hawpac_ntl+hawpac)/tot)
gen otherper=100*((other_ntl+other)/tot)
gen multiraceper=100*((multirace_ntl+multirace)/tot)

In [7]:
list in 1


     +------------------------------------------------------------------------+
  1. | gisjoin  | year  | stusab  | regiona  | divisi~a  |   state  | statea  |
     |    G010  | 2010  |     AL  |       3  |        6  | Alabama  |      1  |
     |---------------------------------------------------+--------------------|
     | countya | cousuba | cousubcc | placea  | placecc  | tracta  | blkgrpa  |
     |       . |       . |        . |      .  |       .  |      .  |       .  |
     |-------------------------------------------------------------+----------|
     | blocka | concita | aianhha | res_on~a | trusta  | aianhhcc  | aitscea  |
     |      . |       . |       . |        . |      .  |        .  |       .  |
     |------------------------------------------------------------------------|
     | aits | ttracta | tblkgrpa | anrca | cbsaa | metdiva  | csaa  | nectaa  |
     |    . |       . |        . |     . |     . |       .  |    .  |      .  |
     |---------------------------------

In [5]:
*I only want to keep relevant variables
keep name geocode latper-multiraceper

#### MERGE
Another important data cleaning tool is merging. Sometimes we want to link two different dataset together. In order to ```merge``` datasets, they need to be linked with key variable. In geographical data, the *Census* gives unique code to Census states, counties, tracts, and blocks. In the current dataset geocode is the FIPS code. In the PPI data, geoid2 is the FIPS code. When using ```merge``` the unique identifier needs to be same variable name.

```merge 1:1 key_variable using file_path_name```

In [6]:
*I am renaming to make the using file
rename geocode geoid2

In [7]:
cd "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data\"
merge 1:1 geoid2 using "incarceration2021 4 8 25.dta"


D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data


    Result                           # of obs.
    -----------------------------------------
    not matched                             1
        from master                         0  (_merge==1)
        from using                          1  (_merge==2)

    matched                                52  (_merge==3)
    -----------------------------------------


This output gives us important information of the merge. It also creates a new variable

In [3]:
describe


Contains data
  obs:             0                          
 vars:             0                          
 size:             0                          
Sorted by: 


In [1]:
list name latper latincar_per 

no variables defined


r(111);





### COLLAPSE
Making datasets based on summary statistics. Let's say we want the averages by EPA region.

```collapse (statistics) varname, by(cat_name)```

In [73]:
collapse (mean) latper-multiraceper whtincar_per-multincar_per,by(eparegion)

## World Bank Data
Another popular secondary data is [World Bank's World Development Indicators](https://databank.worldbank.org/source/world-development-indicators). 

### Country
They have countries and aggregates. You usually want countries. Aggregates are country groups like "European counties" or "High income countries"

### Series
This is the type of variables. 

### Time
This is where to pick the years.

For purpose of this assignment, pick all countries. Variables: GDP per capita (contstant 2015 US$) and carbon dioxide (CO2) emission (total) excluding LULUCF per capita. Years:  2010-2023. Then press "Download options" near the top, right corner. Pick .csv

Again, you have the unzip the files. I have a folder where I unzip all files. Pick the one without metadata.

In [105]:
import delimited "D:\documents copy\teaching\SOCG 206 spring 2025\jupyter\data\240b728c-ef6f-4084-8509-9c06e59888a9_Data.csv", clear
desc


(18 vars, 440 obs)


Contains data
  obs:           440                          
 vars:            18                          
 size:       184,800                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
v1              str48   %48s                  
v2              str12   %12s                  
v3              str74   %74s                  
v4              str20   %20s                  
v5              str19   %19s                  
v6              str19   %19s                  
v7              str19   %19s                  
v8              str19   %19s                  
v9              str19   %19s                  
v10             str19   %19s                  
v11             str19   %19s                  
v12             str19   %19s              

**Stata** read the variables as string meaning, there are some characters. Let's do some visual tests.

In [40]:
list in 1/5
list in 435/440



     +-------------------------------------------------------------------------+
  1. |                      v1            |                      v2            |
     |            Country Name            |            Country Code            |
     |-------------------------------------------------------------------------|
     | v3                                                                      |
     | Series Name                                                             |
     |-------------------------------------------------------------------------|
     |                    v4   |                  v5   |                  v6   |
     |           Series Code   |       2010 [YR2010]   |       2011 [YR2011]   |
     |-------------------------------------------------------------------------|
     |                  v7   |                  v8    |                  v9    |
     |       2012 [YR2012]   |       2013 [YR2013]    |       2014 [YR2014]    |
     |--------------------

In [41]:
list v5 in 1/20


     +---------------------+
     |                  v5 |
     |---------------------|
  1. |       2010 [YR2010] |
  2. |    542.871030476037 |
  3. |   0.275380974794698 |
  4. |    3577.11432498868 |
  5. |    1.56432102617866 |
     |---------------------|
  6. |    4456.61027416201 |
  7. |    3.27771433788594 |
  8. |    12446.3086265015 |
  9. | 0.00181067574418773 |
 10. |    36277.2644412087 |
     |---------------------|
 11. |                  .. |
 12. |    3114.69608592822 |
 13. |   0.976178185498281 |
 14. |    16132.6118723322 |
 15. |    3.26418190342241 |
     |---------------------|
 16. |    13387.1553748458 |
 17. |    4.29667743910718 |
 18. |    2796.08143428043 |
 19. |    1.41184971098266 |
 20. |    27324.5554259804 |
     +---------------------+


We can see there are some notes at the bottom of the file.

We also see that the data uses .. for missing values. <b>Remember missing values is different than zero!</b>

In [106]:
*drops notes
drop in 436/440

(5 observations deleted)


In [107]:
*Rename variables
rename v1 countryname
rename v2 countrycode
rename v3 varname
rename v4 varcode

In [108]:
*renaming variable years
local i=2010
foreach x of varlist v5-v18 {
    rename `x' yr`i'
    local i = `i' +1
    }

In [109]:
*the first row is column names; we need to drop it
drop in 1

(1 observation deleted)


In [110]:
*need to destring the year data
destring yr20*, replace ignore("..")

yr2010: character . removed; replaced as double
(23 missing values generated)
yr2011: character . removed; replaced as double
(22 missing values generated)
yr2012: character . removed; replaced as double
(23 missing values generated)
yr2013: character . removed; replaced as double
(22 missing values generated)
yr2014: character . removed; replaced as double
(22 missing values generated)
yr2015: character . removed; replaced as double
(20 missing values generated)
yr2016: character . removed; replaced as double
(23 missing values generated)
yr2017: character . removed; replaced as double
(23 missing values generated)
yr2018: character . removed; replaced as double
(23 missing values generated)
yr2019: character . removed; replaced as double
(23 missing values generated)
yr2020: character . removed; replaced as double
(23 missing values generated)
yr2021: character . removed; replaced as double
(23 missing values generated)
yr2022: character . removed; replaced as double
(25 missing valu

In [47]:
desc
list in 1/2



Contains data
  obs:           434                          
 vars:            18                          
 size:       115,444                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
countryname     str48   %48s                  
countrycode     str12   %12s                  
varname         str74   %74s                  
varcode         str20   %20s                  
yr2010          double  %10.0g                
yr2011          double  %10.0g                
yr2012          double  %10.0g                
yr2013          double  %10.0g                
yr2014          double  %10.0g                
yr2015          double  %10.0g                
yr2016          double  %10.0g                
yr2017          double  %10.0g                
yr2018         

In [50]:
list countryname in 1/5


     +-------------+
     | countryname |
     |-------------|
  1. | Afghanistan |
  2. | Afghanistan |
  3. |     Albania |
  4. |     Albania |
  5. |     Algeria |
     +-------------+


In [51]:
list varname in 1/5


     +-------------------------------------------------------------------------+
     | varname                                                                 |
     |-------------------------------------------------------------------------|
  1. | GDP per capita (constant 2015 US$)                                      |
  2. | Carbon dioxide (CO2) emissions excluding LULUCF per capita (t CO2e/ca.. |
  3. | GDP per capita (constant 2015 US$)                                      |
  4. | Carbon dioxide (CO2) emissions excluding LULUCF per capita (t CO2e/ca.. |
  5. | GDP per capita (constant 2015 US$)                                      |
     +-------------------------------------------------------------------------+


When using this data, you need to clean it too. One important tool is learning to ```reshape``` in **Stata**.

### RESHAPE
Sometimes we want to flip data. 

<b>LONG FORM</b>
| i | j | stub |
|:--------:|:--------:|:--------:|
|  1   |  1   |  4.1   |
|  1   |  2   |  4.5   |
|  2   |  1   |  3.3   |
|  2   |  2   |  3.0   |

<b>WIDE FORM</b>
| i | stub1 | stub2 |
|:--------:|:--------:|:--------:|
|  1   |  4.1   |  4.5   |
|  2   |  3.3   |  3.0   |

The World Bank is in a mix of wide and long form. The years are in wide form, so lets first reshape the data where year is a variable

In [111]:
reshape long yr, i(countryname countrycode varname) j(year)

rename yr value


(note: j = 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
> )

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      434   ->    6076
Number of variables                  18   ->       6
j variable (14 values)                    ->   year
xij variables:
               yr2010 yr2011 ... yr2023   ->   yr
-----------------------------------------------------------------------------



In [93]:
desc


Contains data
  obs:         6,076                          
 vars:             6                          
 size:       996,464                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
countryname     str48   %48s                  
countrycode     str12   %12s                  
varname         str74   %74s                  
year            int     %10.0g                
varcode         str20   %20s                  
value           double  %10.0g                
--------------------------------------------------------------------------------
Sorted by: countryname  countrycode  varname  year
     Note: Dataset has changed since last saved.


Data went from wide to long

In [57]:
list countrycode varcode year value in 1/5


     +----------------------------------------------------+
     | count~de                varcode   year       value |
     |----------------------------------------------------|
  1. |      AFG   EN.GHG.CO2.PC.CE.AR5   2010   2.754e+14 |
  2. |      AFG   EN.GHG.CO2.PC.CE.AR5   2011   3.887e+14 |
  3. |      AFG   EN.GHG.CO2.PC.CE.AR5   2012   3.196e+14 |
  4. |      AFG   EN.GHG.CO2.PC.CE.AR5   2013   2.625e+14 |
  5. |      AFG   EN.GHG.CO2.PC.CE.AR5   2014   2.386e+14 |
     +----------------------------------------------------+


Now the rest of varname/varcode are in long form and need to be converted to wide.

In [59]:
tab varname
tab varcode



                                varname |      Freq.     Percent        Cum.
----------------------------------------+-----------------------------------
Carbon dioxide (CO2) emissions exclud.. |      3,038       50.00       50.00
     GDP per capita (constant 2015 US$) |      3,038       50.00      100.00
----------------------------------------+-----------------------------------
                                  Total |      6,076      100.00


             varcode |      Freq.     Percent        Cum.
---------------------+-----------------------------------
EN.GHG.CO2.PC.CE.AR5 |      3,038       50.00       50.00
      NY.GDP.PCAP.KD |      3,038       50.00      100.00
---------------------+-----------------------------------
               Total |      6,076      100.00


In [112]:
gen varhold=.
replace varhold=1 if varcode=="EN.GHG.CO2.PC.CE.AR5"
replace varhold=2 if varcode=="NY.GDP.PCAP.KD"


(6,076 missing values generated)

(3,038 real changes made)

(3,038 real changes made)


In [113]:
*reshape is sensitive to unnecessary variable
drop varname varcode

In [114]:
reshape wide value, i(countryname countrycode year) j(varhold)

(note: j = 1 2)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                     6076   ->    3038
Number of variables                   5   ->       5
j variable (2 values)           varhold   ->   (dropped)
xij variables:
                                  value   ->   value1 value2
-----------------------------------------------------------------------------


In [116]:
list in 1/5


     +-------------------------------------------------------+
     | countryname   count~de   year      value1      value2 |
     |-------------------------------------------------------|
  1. | Afghanistan        AFG   2010   2.754e+14   5.429e+14 |
  2. | Afghanistan        AFG   2011   3.887e+14   5.254e+13 |
  3. | Afghanistan        AFG   2012   3.196e+14   5.689e+14 |
  4. | Afghanistan        AFG   2013   2.625e+14   5.806e+14 |
  5. | Afghanistan        AFG   2014   2.386e+14   5.751e+14 |
     +-------------------------------------------------------+


In [117]:
rename value1 co2pc
rename value2 gdppc
list in 1/5





     +-------------------------------------------------------+
     | countryname   count~de   year       co2pc       gdppc |
     |-------------------------------------------------------|
  1. | Afghanistan        AFG   2010   2.754e+14   5.429e+14 |
  2. | Afghanistan        AFG   2011   3.887e+14   5.254e+13 |
  3. | Afghanistan        AFG   2012   3.196e+14   5.689e+14 |
  4. | Afghanistan        AFG   2013   2.625e+14   5.806e+14 |
  5. | Afghanistan        AFG   2014   2.386e+14   5.751e+14 |
     +-------------------------------------------------------+
