Skip to content
This repository has been archived by the owner. It is now read-only.
Permalink
Browse files
updated indian languages page
  • Loading branch information
mjpost committed Aug 14, 2013
1 parent d0972c8 commit b60d6d7b81c403bce436cd4b31d0132e93cce35c
Showing 1 changed file with 243 additions and 133 deletions.
@@ -1,136 +1,246 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<link rel="stylesheet" type="text/css" media="screen,print" href="../joshua.css" />
<title>Indian Languages Parallel Corpora</title>
</head>

<body>

<div id="content">

<h1>Indian Parallel Corpora <span id="download"><a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">Download</a></span></h1>
<hr />

<h2>Description</h2>
<a name="Description"/>

This page describes a set of parallel corpora between English and six languages from the Indian
sub-continent:

<ul>
<li>Bengali</li>
<li>Hindi</li>
<li>Malayalam</li>
<li>Tamil</li>
<li>Telugu</li>
<li>Urdu</li>
</ul>

<p>
They can be used to train (and evaluate) models
for <a href="http://en.wikipedia.org/wiki/Statistical_machine_translation">automatically
translating</a> text into and out of these languages.
They were collected by translating Indian Wikipedia articles into English
using Amazon's Mechanical Turk. Their collection and release are described in the paper:
</p>

<blockquote>
<i>Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing</i> <br/>
<a href="http://cs.jhu.edu/~post">Matt Post</a>, <a href="http://cs.jhu.edu/~ccb">Chris
Callison-Burch</a>, and <a href="http://homepages.inf.ed.ac.uk/miles/">Miles Osborne</a> <br/>
<a href="http://statmt.org/wmt12">WMT 2012</a> <br/>
<a class="pdf" href="http://aclweb.org/anthology-new/W/W12/W12-3152.pdf">PDF</a>
<a class="bibtex" href="http://aclweb.org/anthology-new/W/W12/W12-3152.bib">BIB</a>
</blockquote>

<h2>Download & License</h2>

The Indian parallel corpora dataset
is <a href="https://github.com/joshua-decoder/indian-parallel-corpora">hosted on Github</a>.
You can download a tarball directly
by <a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">clicking
here</a>. The corpus is licensed under the <a href="http://creativecommons.org/">Creative
Commons</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/">Attribution-Sharealike 3.0
Unported License</a> (CC BY-SA 3.0).

<h2>Citations</h2>

<p>
The following publications have made use of this dataset.
</p>

<ol>
<li><b>Post, Callison-Burch, and Osborne (2012)</b> This paper introduced the parallel
corpora, describing how the data was collected, reporting the results of prelimary
experiments, and suggesting some potential research directions.
</ol>

<h2>Scores</h2>

<p>
Below are the best translation scores (case-insensitive BLEU-4) that have been reported on the
provided test sets. The Google results were recorded in the fall of 2011 (and are described
in Post et al. (2012)). Google does not have a Malayalam system.
</p>

<div>
<table id=results>
<tr>
<th style="width:150px">Citation</th>
<th>BN</th>
<th>HI</th>
<th>ML</th>
<th>TA</th>
<th>TE</th>
<th>UR</th>
</tr>
<tr>
<td colspan=7><hr/></td>
</tr>
<tr>
<td class="system">Google</td>
<td>20.01</td>
<td>25.21</td>
<td>&ndash;</td>
<td>13.51</td>
<td>16.03</td>
<td>23.09</td>
</tr>
<tr>
<td class="system">Post et al. (2012)</td>
<td>13.53</td>
<td>17.29</td>
<td>13.72</td>
<td> 9.81</td>
<td>12.46</td>
<td>19.53</td>
</tr>
<tr>
<td class="system">Post et al. (2012)</td>
<td>13.53</td>
<td>17.29</td>
<td>13.72</td>
<td> 9.81</td>
<td>12.46</td>
<td>19.53</td>
</tr>
</table>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Indian Languages Parallel Corpora</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="">

<!-- Le styles -->
<link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
#download {
background-color: green;
font-size: 14pt;
font-weight: bold;
text-align: center;
color: white;
border-radius: 5px;
padding: 4px;
}

#download a:link {
color: white;
}

#download a:hover {
color: lightgrey;
}

#download a:visited {
color: white;
}

a.pdf {
font-variant: small-caps;
/* font-weight: bold; */
font-size: 10pt;
color: white;
background: brown;
padding: 2px;
}

a.bibtex {
font-variant: small-caps;
/* font-weight: bold; */
font-size: 10pt;
color: white;
background: orange;
padding: 2px;
}

img.sponsor {
height: 120px;
margin: 5px;
}
</style>
<link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">

<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="bootstrap/js/html5shiv.js"></script>
<![endif]-->

<!-- Fav and touch icons -->
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
<link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
<link rel="shortcut icon" href="bootstrap/ico/favicon.png">
</head>

<body>

<div class="navbar navbar-inverse navbar-fixed-top">
<div class="navbar-inner">
<div class="container">
<button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="brand" href="#">Joshua</a>
<div class="nav-collapse collapse">
<ul class="nav">
<li class="active"><a href="/">Home</a></li>
<li><a href="index.html">Indian Languages</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</div>
</div>
</div>
<div style="width: 250px; margin-top: 100px">
<img width="250px" src="images/map1.png"/>
<p style="clear: both; text-align: center"><b>Indo-Aryan languages</b></p>

<img width="250px" src="images/map2.png"/>
<p style="clear: both; text-align: center"><b>Dravidian languages</b></p>
</div>

</body>
<div class="container">

<div class="row">
<div class="span8">
<h1>Indian Languages Parallel Corpora</h1>
</div>
<div>
<p>
<br/>
<span id="download">
<a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">Download</a>
</span>
</p>
</div>
</div>

<hr />

<div class="row">
<div class="span8">

<h2>Description</h2>

This page describes a set of parallel corpora between English and six languages from the
Indian sub-continent:

<ul>
<li>Bengali</li>
<li>Hindi</li>
<li>Malayalam</li>
<li>Tamil</li>
<li>Telugu</li>
<li>Urdu</li>
</ul>

<p>
They can be used to train (and evaluate) models
for <a href="http://en.wikipedia.org/wiki/Statistical_machine_translation">automatically
translating</a> text into and out of these languages. They were collected by
translating Indian Wikipedia articles into English using Amazon's Mechanical Turk.
Their collection and release are described in the paper:
</p>

<blockquote>
<i>Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing</i> <br/>
<a href="http://cs.jhu.edu/~post">Matt Post</a>, <a href="http://cs.jhu.edu/~ccb">Chris
Callison-Burch</a>, and <a href="http://homepages.inf.ed.ac.uk/miles/">Miles
Osborne</a> <br/>
<a href="http://statmt.org/wmt12">WMT 2012</a> <br/>
<a class="pdf" href="http://aclweb.org/anthology-new/W/W12/W12-3152.pdf">PDF</a>
<a class="bibtex" href="http://aclweb.org/anthology-new/W/W12/W12-3152.bib">BIB</a>
</blockquote>

<h2>Download & License</h2>

The Indian parallel corpora dataset
is <a href="https://github.com/joshua-decoder/indian-parallel-corpora">hosted on
Github</a>. You can download a tarball directly
by <a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">clicking
here</a>. The corpus is licensed under the <a href="http://creativecommons.org/">Creative
Commons</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/">Attribution-Sharealike
3.0 Unported License</a> (CC BY-SA 3.0).

<h2>Citations</h2>

<p>
The following publications have made use of this dataset.
</p>

<ul>
<li><b>Post, Callison-Burch, and Osborne (2012)</b>. This paper introduced the parallel
corpora, describing how the data was collected, reporting the results of prelimary
experiments, and suggesting some potential research directions.
</li>
</ul>

<h2>Scores</h2>

<p>
Below are the best translation scores (case-insensitive BLEU-4) that have been
reported on the provided test sets. The Google results were recorded in the fall of
2011 (and are described in Post et al. (2012)). Google does not have a Malayalam
system.
</p>

<div>
<table>
<tr>
<th style="width:150px">Citation</th>
<th>BN</th>
<th>HI</th>
<th>ML</th>
<th>TA</th>
<th>TE</th>
<th>UR</th>
</tr>
<tr>
<td class="system">Google</td>
<td>20.01</td>
<td>25.21</td>
<td>&ndash;</td>
<td>13.51</td>
<td>16.03</td>
<td>23.09</td>
</tr>
<tr>
<td class="system">Post et al. (2012)</td>
<td>13.53</td>
<td>17.29</td>
<td>13.72</td>
<td> 9.81</td>
<td>12.46</td>
<td>19.53</td>
</tr>
</table>
</div>
</div>

<div class="span4">
<div>
<img width="250px" src="images/map1.png"/>
<p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Indo-Aryan_languages">Indo-Aryan languages</a></p>

<img width="250px" src="images/map2.png"/>
<p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Dravidian_languages">Dravidian languages</a></p>
</div>
</div>
</div>
</div> <!-- /container -->

<!-- Le javascript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="bootstrap/js/jquery.js"></script>
<script src="bootstrap/js/bootstrap-transition.js"></script>
<script src="bootstrap/js/bootstrap-alert.js"></script>
<script src="bootstrap/js/bootstrap-modal.js"></script>
<script src="bootstrap/js/bootstrap-dropdown.js"></script>
<script src="bootstrap/js/bootstrap-scrollspy.js"></script>
<script src="bootstrap/js/bootstrap-tab.js"></script>
<script src="bootstrap/js/bootstrap-tooltip.js"></script>
<script src="bootstrap/js/bootstrap-popover.js"></script>
<script src="bootstrap/js/bootstrap-button.js"></script>
<script src="bootstrap/js/bootstrap-collapse.js"></script>
<script src="bootstrap/js/bootstrap-carousel.js"></script>
<script src="bootstrap/js/bootstrap-typeahead.js"></script>

</body>
</html>

0 comments on commit b60d6d7

Please sign in to comment.