# Strings & Files

## Let's look at data:
* Twelve-month prevalence and population estimates of DSM-IV alcohol abuse by age, sex, and race-ethnicity: United States, 2001–2002 (NESARC) [ txt format](https://pubs.niaaa.nih.gov/publications/aeds/aodprevalence/abusdep1.txt)

* Twelve-month prevalence and population estimates of DSM-IV alcohol dependence by age, sex, and race-ethnicity: United States, 2001–2002 (NESARC) [ txt format](https://pubs.niaaa.nih.gov/publications/aeds/aodprevalence/abusdep2.txt)

Source: [National Institute of Alcohol Abuse and Alcoholism](https://pubs.niaaa.nih.gov/publications/aeds/aodprevalence/aodprevalence.htm)

In [None]:
! head ../data/abusdep1.txt

# How do we turn that into something computer readable?

with is a `context manager`. It makes sure that the file is closed when the program is done with it. 
r is the file mode. The most commonly used options are: 
* r - read
* w - write
* a - append
* \+ - read and write


In [None]:
filepath = "../data/abusdep1.txt"
with open(filepath, 'r') as f:
    content = f.read()

# How do we find encoding? right click->view page info
![screenshot of page info window with text encoding=windows-1252 highlighted](figs/L05/textencoding.png)

In [None]:
filepath = "../data/abusdep1.txt"
with open(filepath, 'r', encoding="windows-1252") as f:
    content = f.read()

In [None]:
content

# What is that?

In [None]:
type(content)

#  String library: 
* https://docs.python.org/3/library/stdtypes.html#string-methods

```python
 str.split(sep=None, maxsplit=-1)
```
```
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified or -1, then there is no limit on the number of splits (all possible splits are made).
```

In [None]:
content.split("\n")

# Can alternatively do this directly on the read:

In [None]:
filepath = "../data/abusdep1.txt"
with open(filepath, 'r', encoding="windows-1252") as f:
    content = f.readlines()

In [None]:
content

# What line does the total data start at?

The `enumerate` function yields the index and the item in the list

In [None]:

for i, line in enumerate(content):
    print(i, line)

In [None]:
for i, line in enumerate(content):
    if 'Total' in line:
        print(i, line)

In [None]:
# Lets' grab the total that has to do with line 12, and we know each block is 5 lines

total = content[12:12+5]

In [None]:
total

In [None]:
# how do we break up a line:
total[0]

In [None]:
# the default seperator on split is white space
total[0].split()

In [None]:
# how do we store this? Lets go back up and see what the columns mean
print(content[5])
print(content[8])

# how do we blend this properly?

In [None]:
sex = content[5].split()
sex

In [None]:
msmt = content[8].split()
msmt

Exercise:
Manipulate the sex and msmt lists to get the following column headings:
```
['characteristic', 'Male-%', 'Male-S.E.', 'Male-estimate', 'Female-%', 'Female-S.E.', 'Female-estimate',
 'Total-%', 'Total-S.E.', 'Total-estimate']
```

In [None]:
columns = [msmt[0]]
for s in sex:
    for ms in msmt[1:4]:
        columns.append(f'{s}-{ms}')

In [None]:
# Now let's store the total records using our columns
measurements = total[0].split()

In [None]:
# How to combine the two? 
rec = zip(columns, measurements)

In [None]:
rec

In [None]:
#how to store
record = dict(rec)
record

In [None]:
list(rec) # Why is it empty? Zip is a generator, which means use once!

In [None]:
records = [record]
records

In [None]:
# lets store everything else in total in records:
total

In [None]:
records = []
for row in total:
    records.append(dict(zip(columns, row.split())))

In [None]:
records

## Exercise
Parse the other demographics & store the results as record dicts in the records list. 
Hint: Instead of copy and pasting, what should be done with the operations that are consistent across records?

# Let's now work with our records

In [None]:
records

In [None]:
# lets do a validity check: sum the total for men and check it against the total
msm=0
for row in records:
    if row['characteristic'] != 'Total':
        msm +=int(row['Male-estimate'])

In [None]:
msm, records[0]['Male-estimate']

## Exercise
### Try with a different grouping

# Partner up & come up with 3 questions you want to ask of the data

# Lets save out the cleaned up data as a spreadsheet


In [None]:
import csv

with open('abuse1.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=columns)
    writer.writeheader()
    for row in records:
        writer.writerow(row)

In [None]:
!head abuse1.csv

In [None]:
# lets add one more column for race-ethnicity, will be useful for more advanced spreadsheet operations 
# in doing so, we're going to introduce a new library for working with spread sheets-pandas

In [None]:
# df is a DataFrame object - it is a complex object type we can use to do things
import pandas as pd
df = pd.read_csv("abuse1.csv")

In [None]:
df.head()

In [None]:
# dataframe columns are access like dictionaries
df['characteristic']

In [None]:
# And each column is a list
charac = df['characteristic'][0]
charac

In [None]:
# create a new column in the same manner we created a new dictionary element
df['race-ethnicity'] = charac

In [None]:
df

# How do we get the correct race-ethnicity for the other demographics?

In [None]:
# lets save out the newest spreadsheet
df.to_csv("abuse1s.csv", index=False)

In [None]:
!head abuse1s.csv