# Interoperability, or making things speak to each other
## Or, turning everything into numbers

In this studio, we're going to play around with some functions of interoperability - or, explore different data structures and how to make them connect up. 

I find this kind of data science really frustrating (even though I work in data visualization), because it is really unintiuitive. It requires a *weird* kind of imagining, so if you feel like you're losing your footing, don't worry, you're not alone. It's vertiginous!

But, it means that we get to roll together some of the ideas we learnt about in our Tables studio, and our Algorithms studio, and prepare for our Images studio in a few weeks time. 

# Turning the social into numbers

Let's start by going back and getting some data from our last python studio.

## Data from your Reddit API

Sign in to reddit using Google Chrome in a separate tab.

Then go to this page: https://www.reddit.com/prefs/apps

You should already have an app. If you don't, click **create app**

![create app](https://miro.medium.com/max/2400/1*I06ZUKgMjooh2hFopGrqjQ.png)

In the form that will open, you should enter your name, description and uri. For the redirect uri you should choose http://localhost:8080

![redirect uri](https://miro.medium.com/max/2400/1*SrohBPmEox1R9Qdp0K8Z6w.jpeg)

Now, let's import our packages and set up our API connection. You need to fill out your own ID details!

In [None]:
import praw
import pandas as pd

reddit = praw.Reddit(client_id="bst9HmJXayXZFw",      # your client id
                     client_secret="Iq1bwbLIaowB25CY-v6F1qPMLIBuIA",  #your client secret
                     user_agent="android:com.example.myredditapp:v1.2.3 (by u/Glass_Relationship_3)", #user agent name
                     username = "Glass_Relationship_3",     # your reddit username
                     password = "studios9876")     # your reddit password

print(reddit)

Now, let's scrape our subreddit. In the `sub` section you can choose your subreddit, and then use `query` to run a search term.  

At the end we'll convert it into a panda data frame called "post_data" which we will use for later gymnastics, and save it to CSV for good measure. 

In [None]:
sub = ['berkeley'] # your subreddit

for s in sub:
    
    subreddit = reddit.subreddit(s)
    query = ['I']
 
    for item in sub:
        posts = {
            "title" : [],   #title of the post
            "score" : [],   # score of the post
            "id" : [],      # unique id of the post
            "url" : [],     #url of the post
            "comms_num": [],   #the number of comments on the post
            "created" : [],  #timestamp of the post
            "upvote_ratio" : [],         # the description of post
            "body" : [] #the body of the post
        }
        for submission in subreddit.search(query,sort = "top",limit = 1000): #max 1k
            posts["title"].append(submission.title)
            posts["score"].append(submission.score)
            posts["id"].append(submission.id)
            posts["url"].append(submission.url)
            posts["comms_num"].append(submission.num_comments)
            posts["created"].append(submission.created_utc)
            posts["upvote_ratio"].append(submission.upvote_ratio)
            posts["body"].append(submission.selftext)
        

        post_data = pd.DataFrame(posts)
        post_data.to_csv(s+"_"+ item +"subreddit.csv")

print(subreddit)

For more info on the parameters you can request for a submission, see: http://lira.no-ip.org:8080/doc/praw-doc/html/code_overview/models/submission.html

## Finding numbers in data

This next section, we're going to get used to different computational types and how they work together.

Let's see what our post_data from Reddit looks like:

In [None]:
post_data.head()

Different data types have different properties which allow them to do things, or not do things. For instance, you can't plot a character on a graph.

In Python, these are the main data types (thanks to Shawn Ren for the graph):
![](https://miro.medium.com/max/700/1*QfI8H_8HplGa1v9IrrWjBA.png) 

So, let's check out the data types of our `post_data` data set:

In [None]:
print(post_data.dtypes)

We're seeing a lot of Python/Panda objects (because this is a dataframe, and which we will need to convert to use), but also some integers and floating points, which are numeric forms. This is awesome! 

So, let's try plotting some data using matplotlib's pyplot. Most digital images are Cartesian (like maps!), meaning that they work on an x,y axis, where each pixel is assigned an x,y coordinate. This coordinate system, called algebraic geometry, combines spatial measurement forms with numeric forms.

![](https://images.deepai.org/glossary-terms/7d0273fdc6cc42aca2fbdd72b61a4499/cartesian.png)

So, you can set any of the int64 or float74 values against each other:

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

# scatter the comments against the likes
ax.scatter(post_data['upvote_ratio'], post_data['comms_num']) #format (dataframe1(column), dataframe2(column))
# set a title and labels
ax.set_title('Number of Comments vs Upvote Ratio')
ax.set_xlabel('No of Upvotes to Downvotes')
ax.set_ylabel('number of comments')

Okay, cool. But what about the time of the post. Take a look at the "created" column - this is a time stamp in Unix time, which is a universal time that is free from timezones:

> Unix time (a.k.a. POSIX time or Epoch time) is a system for describing instants in time, defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds. It is used widely in Unix-like and many other operating systems and file formats. Due to its handling of leap seconds, it is neither a linear representation of time nor a true representation of UTC. 

We're going to need to bring that into something readable to humans! So, let's convert it, and make sure that it's in a `datetime` format and that it looks about right to the human eye:

In [None]:
post_data["created"]= pd.to_datetime(post_data["created"], yearfirst=True, unit="s")
print(post_data.dtypes)
post_data.head()

Now, let's plot the date compared to the number of comments?

In [None]:
fig, ax = plt.subplots()

# scatter the comments against the likes
ax.scatter(post_data['created'], post_data['comms_num'])
# set a title and labels
ax.set_xlabel('Date')
ax.set_ylabel('Number of Comments on Posts')

So, we've been using the useful and classic `matplotlib` to do our graphics. But it's not really the best. Let's try another and see if we can get some more information. Let's use `plotly`

In [None]:
!pip install plotly==4.14.3

In [None]:
import plotly.express as px

fig = px.scatter(post_data, x="created", y="upvote_ratio", size="comms_num", color="score", hover_name="title", size_max=60)
fig.show()

I'm not going to bore you with more graphs - but when you're feeling up to it, feel free to take a look at the different kinds of charts you can make and have a play around - you could even combine several reddit datasets!

https://plotly.com/python/

# Turning our bodies into numbers

Now, let's turn to something a little more complicated, with some reflections on Wernimont's piece on the Quantified Self and explore some of the ways in which our bodies are made data. 

I've located and exported my own (seriously incomplete, and didn't even realise I had authorised it) health data from my iPhone's Health App for a laugh. 

When downloaded, this comes in a .zip format. When expanded, you get two files - `export.xml` is the one that we want. 

XML, like geojson is good format for holding together different types of data in the same document (like we learned with geojson). But it's not super useful for python, so we're going to run the `apple-health-data-parser` created by Nicholas Radcliffe to "parse" or separate out the data into different CSV files. Then we can have a little look at it more closely.

Normally, we would run a .py file using the command line (like terminal), but Jupyter is friendly, and actually lets us run .py files like a command line from inside the notebook! So, making sure that the following are in the same folder (which they will be if you have downloaded this from github) - `Interoperability_Studio.ipynb`, `apple-heath-data-parser.py` and `export.xml`, let's try to do some parsing!

In [None]:
# %run -i 'apple-health-data-parser' 'export.xml' 
%run -i 'apple-health-data-parser' 'export.xml'

Awesome! Looks like like Apple has been secretly collecting four kinds of my data: flights of stairs climbed, how often and loudly I use my headphones, my step count and how far I walk. Let's explore some of this data.

We start by installing (if we haven't already) 3 libraries: `numpy` (or nummber python, num-py), `pandas` (our much loved data format), and `glob`, which helps us find data paths on our computers, the `pytz` time zone calculator, `pyplot` for making graphs, and `datetime`, which does as it says. 

In [None]:
import numpy as np
import pandas as pd
import glob
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import date, datetime, timedelta as td
import pytz


Okay, let's see what this data is all about. 

In [None]:
steps = pd.read_csv("StepCount.csv") #use pandas (pd) to read the csv file
steps.head() #have a look at the top row of data

And check out what kind of data we're working with here: 

In [None]:
print(steps.dtypes)

Lots of objects, again, and some messy time formats too. Let's clean up. We need to start with date-time - the data crosses a few timezones, I think, but I want to bring it into the one I'm in now - America/Los_Angeles.

In [None]:
# functions to convert UTC to LA time zone and extract date/time elements
convert_tz = lambda x: x.to_pydatetime().replace(tzinfo=pytz.utc).astimezone(pytz.timezone('America/Los_Angeles'))
get_year = lambda x: convert_tz(x).year
get_month = lambda x: '{}-{:02}'.format(convert_tz(x).year, convert_tz(x).month) #inefficient
get_date = lambda x: '{}-{:02}-{:02}'.format(convert_tz(x).year, convert_tz(x).month, convert_tz(x).day) #inefficient
get_day = lambda x: convert_tz(x).day
get_hour = lambda x: convert_tz(x).hour
get_minute = lambda x: convert_tz(x).minute
get_day_of_week = lambda x: convert_tz(x).weekday()

Now, let's "parse" (or separate) out the different time sections:

In [None]:
# parse out date and time elements as LA time
steps['startDate'] = pd.to_datetime(steps['startDate'])
steps['year'] = steps['startDate'].map(get_year)
steps['month'] = steps['startDate'].map(get_month)
steps['date'] = steps['startDate'].map(get_date)
steps['day'] = steps['startDate'].map(get_day)
steps['hour'] = steps['startDate'].map(get_hour)
steps['dow'] = steps['startDate'].map(get_day_of_week)


And check it's lookin' good!

In [None]:
steps.head()

Coolios - as you can see above, EVERYTHING IS NUMBERS. SEPARATE CATEGORISED NUMBERS. What are those categories, you ask?

In [None]:
steps.columns

We can create some groups for each date, to see how many each day.

In [None]:
steps_by_date = steps.groupby(['date'])['value'].sum().reset_index(name='Steps')
steps_by_date.head()

Now, let's save it to CSV for good measure, and so we can start visualising!

In [None]:
steps_by_date.to_csv("steps_per_day.csv", index=False)

Time to turn numbers back into images.

In [None]:
steps_by_date['RollingMeanSteps'] = steps_by_date.Steps.rolling(window=10, center=True).mean()
steps_by_date.plot(x='date', y='RollingMeanSteps', title= 'Daily step counts rolling mean over 10 days', figsize=[10, 6])

What about weekday? Let's regroup our CSV and see what we find?

In [None]:
#regroup
steps_by_date['date'] = pd.to_datetime(steps_by_date['date'])
steps_by_date['dow'] = steps_by_date['date'].dt.weekday

In [None]:
#plot

data = steps_by_date.groupby(['dow'])['Steps'].mean()

fig, ax = plt.subplots(figsize=[10, 6])
ax = data.plot(kind='bar', x='day_of_week')

n_groups = len(data)
index = np.arange(n_groups)
opacity = 0.75

#fig, ax = plt.subplots(figsize=[10, 6])
ax.yaxis.grid(True)

plt.suptitle('Average Steps by Day of the Week', fontsize=16)
dow_labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.xticks(index, dow_labels, rotation=45)
plt.xlabel('Day of Week', fontsize=12, color='red')

What about hours (bearing in mind time zones)

In [None]:
hour_steps = steps.groupby(['hour'])['value'].sum().reset_index(name='Steps')

In [None]:
ax = hour_steps.Steps.plot(kind='line', figsize=[10, 5], linewidth=4, alpha=1, marker='o', color='#6684c1', 
                      markeredgecolor='#6684c1', markerfacecolor='w', markersize=8, markeredgewidth=2)

xlabels = hour_steps.index.map(lambda x: '{:02}:00'.format(x))
ax.set_xticks(range(len(xlabels)))
ax.set_xticklabels(xlabels, rotation=45, rotation_mode='anchor', ha='right')

# ax.set_xlim((hour_steps.index[0], hour_steps.index[-1]))

ax.yaxis.grid(True)
# ax.set_ylim((0, 1300))
ax.set_ylabel('Steps')
ax.set_xlabel('')
ax.set_title('Steps by hour the day')

plt.show()

Let's combine the numeric representation of my lived mobilities. What about flights?

In [None]:
flights = pd.read_csv("FlightsClimbed.csv") #use pandas (pd) to read the csv file
flights.head() #have a look at the top row of data

Let's parse it out again

In [None]:
# parse out date and time elements as LA time
flights['startDate'] = pd.to_datetime(flights['startDate'])
flights['year'] = flights['startDate'].map(get_year)
flights['month'] = flights['startDate'].map(get_month)
flights['date'] = flights['startDate'].map(get_date)
flights['day'] = flights['startDate'].map(get_day)
flights['hour'] = flights['startDate'].map(get_hour)
flights['dow'] = flights['startDate'].map(get_day_of_week)

And group it into dates

In [None]:
flights_by_date = flights.groupby(['date'])['value'].sum().reset_index(name='Flights')
flights_by_date.head()

And save...

In [None]:
flights_by_date.to_csv("flights_by_date.csv", index=False)

In [None]:
flights_by_date['date'] = pd.to_datetime(flights_by_date['date'])
flights_by_date['dow'] = flights_by_date['date'].dt.weekday

In [None]:
#plot

data = flights_by_date.groupby(['dow'])['Flights'].mean()

fig, ax = plt.subplots(figsize=[10, 6])
ax = data.plot(kind='bar', x='day_of_week')

n_groups = len(data)
index = np.arange(n_groups)
opacity = 0.75

#fig, ax = plt.subplots(figsize=[10, 6])
ax.yaxis.grid(True)

plt.suptitle('Average Flights by Day of the Week', fontsize=16)
dow_labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.xticks(index, dow_labels, rotation=45)
plt.xlabel('Day of Week', fontsize=12, color='red')

# Attaching numbers to numbers

To be totally ridiculous, let's compare how many steps I take per day of the week, compared to how many comments on your chosen subreddit.

First, we need to parse out (again) the time/date data. Then, it's just like above, using "groupby", while paying attention to the column headers.

In [None]:
post_data.head()

In [None]:
# parse out date and time elements as LA time
post_data['created'] = pd.to_datetime(post_data['created'])
post_data['year'] = post_data['created'].map(get_year)
post_data['month'] = post_data['created'].map(get_month)
post_data['date'] = post_data['created'].map(get_date)
post_data['day'] = post_data['created'].map(get_day)
post_data['hour'] = post_data['created'].map(get_hour)
post_data['dow'] = post_data['created'].map(get_day_of_week)

post_data.head()

In [None]:
f_df = flights_by_date.groupby(['dow'])['Flights'].sum()
s_df = steps_by_date.groupby(['dow'])['Steps'].median()

fig, ax = plt.subplots(figsize=[10, 6])
f_ax = f_df.plot(kind='line', x='day_of_week')
s_ax = s_df.plot(kind='line', x='day_of_week')

#fig, ax = plt.subplots(figsize=[10, 6])
ax.yaxis.grid(True)

plt.suptitle('Steps VS Reddit Comments', fontsize=16)
dow_labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.xticks(index, dow_labels, rotation=45)
plt.xlabel('Day of Week', fontsize=12, color='green')

Have a mess around in your own time - compare, means to medians, and ask your friends in data science what it's all about, because honestly, it's just a strange kind of magic.

# Turning words into numbers

Turning back to our subreddit, and channelling cultural analytics, let's look a little more closely at some text analysis and see what we can do!

Text works as a `str` or string: 

![](https://csharpcorner.azureedge.net/article/learn-about-strings-in-python/Images/Capture.PNG)

A word is a string of individual letters, a sentence is a string of words!

(Strings are used a lot in the Digital Humanities and Text Processing - I'm a geographer, and still learning about strings, so bear with me!)

Let's start by grabbing a cell with an object from our `post_data` dataset. With a `pandas` data frame, everything works on a gridded position as well! You can use `iloc` (or location by position) to find particular cells. Let's start with row number 3:

In [None]:
post_data.iloc[3] 

Now, if you count down the list, body is number "7", so let's add that to get the cell.

In [None]:
post_data.iloc[3,7] 

Now, let's convert it from a panda object to a string, and give it a name, so we can do some analysis:

In [None]:
cell = str(post_data.iloc[3,7])
print(cell)

We can count how many characters are in the string:

In [None]:
len(cell) #len = length

Or what the 'n' letter of the string is (in the below example, 45th)

In [None]:
cell[45]

### Counting words

If we wanted to be braver, we could even try to count the most common words all the posts in the "title" column:

In [None]:
from collections import Counter
Counter(" ".join(post_data["title"]).split()).most_common(20)

So, there are many "to", "the", "of" .... These are called "stopwords". Let's create a new column with all the stopwords deleted so we can count again. 

To do this we import an nltk dictionary which has a list of words. 


In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')


Then we delete the stopwords from the title column and make a new column without the stopwords.

In [None]:
post_data['title_without_stopwords'] = post_data['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
post_data.head()

And try again...

In [None]:
Counter(" ".join(post_data["title_without_stopwords"]).split()).most_common(20)

Well done! 

(as a bonus, you could turn this into a data frame if you wanted, and plot it as well! - though it's not a super interesting graph!) 

In [None]:
from pandas import DataFrame
words_num = Counter(" ".join(post_data["title_without_stopwords"]).split()).most_common(20)
words_num_df = pd.DataFrame(words_num,columns=['word','count'])
fig = px.scatter(words_num_df, x="word", y="count", hover_name="word")
fig.show()

# Turning sound into numbers

Okay, let's try some data that we don't necessarily think of as numeric: sound.

Let's import some libraries to help us out with sound.

In [None]:
! pip install pydub
! pip install scipy

Now, let's import those libraries and read our file. We're directly reference the `sound_sample.wav` that is in your downloaded folder. And let's print the rate and the audio.

In [None]:
#required libraries
import scipy.io.wavfile
import pydub

rate,audData=scipy.io.wavfile.read("sound_sample.wav")

print(rate)
print(audData)

The output from the wavefile.read are the sampling rate on the track, and the audio wave data. The sampling rate represents the number of data points sampled per second in the audio file. In this case 44100 pieces of information per second make up the audio wave. This is a very common rate. The higher the rate, the better quality the audio.

Let's take a shape of the audio data a second of audio data!

In [None]:
#wav length
audData.shape[0] / rate

Looking at the shape of the audio data it has ONE array, so it's a mono channel.

In [None]:
audData.dtype

The data is stored as int16. This is the size of the data stored in each datapoint. Common storage formats are 8, 16, 32. Again the higher this is the better the audio quality

The values in the data represent the amplitude of the wave (or the loudness of the audio). The energy of the audio can be described by the sum of the absolute amplitude.

In [None]:
#Energy of music
np.sum(audData.astype(float)**2)

This will depend on the length of the audio, the sample rate and the volume of the audio. A better metric is power, which is energy per second...

In [None]:
#power - energy per unit of time
1.0/(2*(audData.size)+1)*np.sum(audData.astype(float)**2)/rate

Now, let's plot the amplitude of the track over time...

In [None]:
import matplotlib.pyplot as plt

#create a time variable in seconds
time = np.arange(0, float(audData.shape[0]), 1) / rate

#plot amplitude (or loudness) over time
plt.figure(1)
plt.subplot(211)
plt.plot(time, audData, linewidth=0.01, alpha=1, color='#00ff00')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

Another common way to analyse audio is to create a spectogram. Audio spectograms are heat maps that show the frequencies of the sound in Hertz (Hz), the volume of the sound in Decibels (dB), against time.

In [None]:
plt.figure(2, figsize=(8,6))
plt.subplot(211)
Pxx, freqs, bins, im = plt.specgram(audData, Fs=rate, NFFT=1024, cmap=plt.get_cmap('viridis'))
cbar=plt.colorbar(im)
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
cbar.set_label('Intensity dB')
plt.show()

The result allows us to pick out a certain frequency and examine it

In [None]:
np.where(freqs==10034.47265625)
MHZ10=Pxx[233,:]
plt.plot(bins, MHZ10, color='#ff7f00')

Okay, that's it for sound!

# Turning images into grids into numbers

In the final section of thsi studio, we're going to use a mixture of `matplotlib` and another library `imageio` to examine how images work as computational data (and how they're all also secretly grids and numbers). 

First, let's import `imageio` (`matplotlib` is already imported above), and drag in an image:

In [None]:
import imageio
#replace the link with the link to an image of your choice
pic = imageio.imread("https://media.npr.org/assets/img/2017/04/25/istock-115796521-fcf434f36d3d0865301cdcb9c996cfd80578ca99-s800-c85.jpg")
plt.figure(figsize = (15,15))

plt.imshow(pic)

All digital images look like this (thanks Stanford for the image): 

![](https://web.stanford.edu/class/cs101/image-diagram1.png)

Just like your graphs above, they have an `x` and `y` axis.

Each pixel is made up of three values: red (r), green (g) and blue (b):

![](https://web.stanford.edu/class/cs101/image-diagram2.png)

We will investigate this a little more in our image workshop, but for now, this provides us two ways of classifying (and so, searching through) the enormous data set that is an image: colour, and position. 

First, let's check that your image is in 3 dimensions (or RGB)

In [None]:
print('Dimension of Image {}'.format(pic.ndim))

Now, let's find the RGB value of a single pixel!

In [None]:
rgb = pic[100, 50]
print(rgb)

Can we split the layers so each image just shows the red, green and blue values?

In [None]:
import numpy as np #thanks to Yassine Hamdaoui for the code
 
fig, ax = plt.subplots(nrows = 1, ncols=3, figsize=(15,5))  
for c, ax in zip(range(3), ax):     
     # create zero matrix        
     split_img = np.zeros(pic.shape, dtype="uint8") 
     # 'dtype' by default: 'numpy.float64'  # assing each channel      
     split_img[ :, :, c] = pic[ :, :, c] # display each channel     
     ax.imshow(split_img)

What happens if we change the r value of the rows 50 to 150 to the full 255 intensity?

In [None]:
import matplotlib.pyplot as plt
    
pic[50:150 , : , 0] = 255 # full intensity to those pixel's R channel 
plt.figure( figsize = (5,5)) 
plt.imshow(pic) 
plt.show()

And finally, let's just highlight only pixel values that are higher than 180 in the r channel!

In [None]:
pic = imageio.imread("https://media.npr.org/assets/img/2017/04/25/istock-115796521-fcf434f36d3d0865301cdcb9c996cfd80578ca99-s800-c85.jpg")
red_mask = pic[:, :, 0] < 180
pic[red_mask] = 0
plt.figure(figsize=(5,5))
plt.imshow(pic)

That's it for today! Don't forget to post your graph or image in the #studios slack channel