### Extract Variable Abbreviations and Descriptions

The goal of this notebook is to extract variable abbreviations from the studies csv files. Variable descriptions will be included if they are available in a codebook, column or in the case of a stata file in the files metadata


In [1]:
import pandas as pd
import fsspec
from pathlib import Path

In [2]:
fs = fsspec.filesystem("")

### Variable Abbreviation

The var's should come from the csv files, as this is the base from where our data will be extracted.

In [3]:
avail_csv = [f.split('/')[-1] for f in fs.ls('../csv/') if f.split('/')[-1][0] != '.']
avail_csv

['Europe_CH_SIB.csv']

Just a single file:

In [5]:
df = pd.read_csv('../csv/Europe_CH_SIB.csv').iloc[:,1:]

In [6]:
df

Unnamed: 0,pt,phyact,alcfrq,sbsmk,ethori_self,jobtyp,lvpl,SBP,DBP,mrtsts2,gender,age,wt,ht,cafuse,HRTRTE,dginvtx2,dginvtx3,cmatccd1_2,cmatccd1_3
0,FAKE0,>3WK,N,N,O,SE,SW,132.5,68.0,0,0,73,71.3,146.0,NONE,80.5,CEREBRAL PROBLEMS,HEROIN ADDICT,S01B,S01CA
1,FAKE1,2WK,SW,S,B,IW,PT,188.0,59.0,1,0,41,69.6,149.0,1B3,83.0,LUMBAR PAINS,ESOPHAGEAL AND STOMACH DISORDER,D05A,C08DB
2,FAKE2,K,R,N,O,PR,SW,178.0,114.0,1,0,52,105.8,143.0,1B3,74.0,ALCOHOL PROBLEMS,"ANXIETY, DEPRESSION",A03FA,J05AE
3,FAKE3,K,SW,F,O,ME,SZ,204.5,114.0,1,1,50,48.7,181.5,>6,66.5,EXPECTORANT,ACID RELFUX DISEASE,B05BA,C05CX
4,FAKE4,K,3D,F,B,PR,SW,110.0,87.5,0,1,43,82.4,182.5,>6,77.0,ANTIBIOTIC (GERM INFECTION),PROSTATE PROBLEM,S01AX,A12CC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6728,FAKE6728,1WK,2WK,N,B,EU,PT,151.0,102.0,1,1,51,89.7,186.0,4B6,90.5,UNKNOWN STOMACH PROBLEM,TONIC,N05AN,C08DB
6729,FAKE6729,N,1D,F,K,QW,PT,155.0,72.0,0,0,40,63.4,182.0,>6,80.0,ALLERGIC TO DUST MITE (STOPPED THE MEDICATION ...,RENAL LITHIASIS,A10BG,S01CA
6730,FAKE6730,2WK,N,S,W,FM,SZ,112.5,55.0,1,1,61,61.3,187.0,4B6,91.0,HANDS PAINS,VESTIBULITIS,N05AX,N02CX
6731,FAKE6731,>3WK,R,S,A,FM,SW,102.0,53.5,1,1,59,82.5,174.0,4B6,79.5,PAIN RELEVER,PREVENTION - WELLNESS,S01XA,N07CA


In [7]:
variables = list(df.columns)

In [8]:
variables

['pt',
 'phyact',
 'alcfrq',
 'sbsmk',
 'ethori_self',
 'jobtyp',
 'lvpl',
 'SBP',
 'DBP',
 'mrtsts2',
 'gender',
 'age',
 'wt',
 'ht',
 'cafuse',
 'HRTRTE',
 'dginvtx2',
 'dginvtx3',
 'cmatccd1_2',
 'cmatccd1_3']

### Variable Descriptions

This part is a little more challenging. And will require some digging.

"""
 The variables encode the following information: age (numeric), gender (categorical numeric, woman 0, man 1), birthplace (categorical string), residence (categorical string), job type (categorical string), family and household structure (categorical numeric, alone 0, couple 1); tobacco (categorical string), alcohol use (categorical string), and physical activity (categorical string); weight (numeric), height (numeric), blood pressure (numeric) and heart rate (numeric); and diagnoses (free text, string) and prescriptions (ATC codes, string). 
 """

In [9]:
descriptions = None

### Construct a pandas dataframe

In [10]:
df = pd.DataFrame({'var': variables, 'description': descriptions})

In [11]:
df

Unnamed: 0,var,description
0,pt,
1,phyact,
2,alcfrq,
3,sbsmk,
4,ethori_self,
5,jobtyp,
6,lvpl,
7,SBP,
8,DBP,
9,mrtsts2,


### Write to metadata folder

In [12]:
df.to_csv('../metadata/variables.csv')