# Data Science Tools

In [None]:
# %load utils/imports.py
import pandas as pd
import cufflinks as cf
cf.go_offline()

# The Shape of Things to Come

You can even use magics to mix languages in a single notebook. For example, `rmagics` lets you run `R` code — including plotting — in a Python notebook. Note that you first need to load the `rpy2` extension.

In [None]:
!pip install --upgrade rpy2
!conda remove --force -y readline
!pip install --upgrade readline

In [None]:
import rpy2
%load_ext rpy2.ipython

In [None]:
%%R
x <- runif(10)
y <- runif(10)

plot(x, y)

As described in the rmagics documentation, you can use `%Rpush` and `%Rpull` to move values back and forth between `R` and `Python`:


In [None]:
lines = cf.datagen.lines(1).values
%Rpush lines

In [None]:
%R plot(lines)

In [None]:
pd.DataFrame(lines).iplot(mode='markers')

You can find other examples of language-magics online, including [SQL magics](https://github.com/catherinedevlin/ipython-sql)

But what's the point? The point is taht magics are handy on their own, but they really shine when you combine them. These functions can help you create pipelines in one visual flow by combining steps in different languages. Getting familiar with magics gives you the power to use the most efficient solution per subtask and bind them together for your project.

When used this way, Jupyter notebooks became “visual shell scripts” tailored for data science work. Each cell can be a step in a pipeline that can use a high-level language directly (e.g., R, Python), or a lower-level shell command. At the same time, your “script” can also contain nicely formatted documentation and visual output from the steps in the process. It can even document its own performance, automatically recording CPU and memory utilization in its output.

## Case Study : Batch, scheduling, and reports

Like any other Python script it is possible to also run your notebook in "batch mode". By using `nbconvert`, you can calculate an entire notebook non-interactively, saving it in place or to a variety of other formats.

This capability makes notebooks a powerful tool for ETL and for reporting. For a report, just schedule your notebooks to run on a recurring basis automatically and update its contents or email its results to colleagues. Or using the magics techniques described above, a notebook can implement a data pipeline or ETL task to run on an automatic schedule, as well.

## Cronjob

[Cron jobs](https://help.ubuntu.com/community/CronHowto) are tasks scheduled to run periodically on a computer or server. They’re easy to set up in Ubuntu and can be tied to a Python script such that the cron job runs it automatically.

Cron jobs are perfect for collecting data from a MySQL database or API and updating a graph or dashboard of graphs with the newest data. T

**Note for Mac and Windows users**: Mac operating systems support cron jobs. To edit the crontab scheduler on a Mac, type `env EDITOR=nano crontab -e` in the Terminal. Windows does not support cron jobs. The equivalent in Windows is the Windows Task Scheduler.

Open your *nix terminal and type `crontab -e`. You may be asked to choose an editor to edit your crontab. I'd recommend the editor “nano” because it is the easiest to navigate for beginners.

In your crontab file, press the down arrow key until you are at the bottom of the file. Then, add this line:

```cronjob
0,30 * * * * ipython nbconvert --execute --to html /home/user/MyNotebook.ipynb >/dev/null 2>&1
```

Make sure that you change the path and file name so that it matches the name and location of your Python script. My Python script is called temperature.py and is saved in /home/ubuntu/ on my Ubuntu server.

If you want to run it and save the results as an ipynb file, you can use `--to notebook`.

The `>/dev/null 2>&1` is optional. It tells cron to not send any output of your script to your server’s mail inbox. If your script has `print()` calls, you’ll need this line.

The five numbers and stars (`0,30 * * * *`) correspond to minute, hour, day, month, and weekday, respectively. The above command tells cron to run temperature.py every 30 minutes, every day of the year. If you instead wanted to run your Python script only once per day, at `4:22am`, you would use this syntax:

```cron
22 04 * * * ipython nbconvert --execute --to html /home/user/MyNotebook.ipynb >/dev/null 2>&1
```

Press <kbd>CTRL</kbd>+<kbd>x</kbd> to exit the crontab file. If you’re asked whether you want to save, make sure that you type `y`. Your notebook should now update every 30 minutes.

Google is your friend when figuring out more complex cron job syntax (ie, “run my Python script every other day and every Sunday at 10:00pm”).

An example of more complex syntax is this ultra-simple webscraper:

```cronjob
*/20 * * * * /usr/bin/wget -O /home/m/scraper/hcom_$(date +\%F_\%T).html http://hotels.com/ > /dev/null 2>&1
```

which downloads the front page of hotels.com every 5 minutes and stores the html file with a timestamp!

### Scheduled dashboard

Let’s say that you have to regularly send a folium map to your colleague’s email with all the earthquakes of the past day.

To be able to do that, you first need an earthquake data set that updates regularly (at least daily). A data feed that updates every 5 minutes can be found [here](http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php). Then, you can use Jupyter to write the code to load this data and create the map.

In [None]:
!pip install --upgrade folium

In [None]:
import pandas as pd
import folium
from matplotlib.colors import Normalize, rgb2hex
import matplotlib.cm as cm

In [None]:
from warnings import filterwarnings as fw
fw("ignore", category=FutureWarning)

data = pd.read_csv('http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv')
norm = Normalize(data['mag'].min(), data['mag'].max())

data['hrs_ago'] = (pd.datetime.now() - pd.to_datetime(data.time)).dt.seconds / 60 / 60

map = folium.Map(location=[22, 114], zoom_start=3)
for eq in data.iterrows():
    color = rgb2hex(cm.OrRd(norm(float(eq[1]['mag']))))
    map.circle_marker([eq[1]['latitude'], eq[1]['longitude']], 
                    popup="{} | {:.0f} hours ago".format(eq[1]['place'],eq[1]['hrs_ago']), 
                    radius=20000*float(eq[1]['mag']),
                    line_color=color,
                    fill_color=color)
map.create_map(path='assets/earthquake.html')

# need to replace CDN with https URLs
with open('assets/earthquake.html', 'r') as f:
    contents = f.read()
    contents = contents.replace("http://cdn.leafletjs.com/leaflet-0.5/", "//cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.7/")
    with open('assets/earthquake2.html', 'w') as f:
        f.writelines(contents)

In [None]:
%%HTML
<iframe width="100%" height="600" src="assets/earthquake2.html?inline=true"></iframe>